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CHAPTER ONE 


INTRODUCTION AND SUMMARY 


1.1 INTRODUCTION 

Burroughs Corporation is pleased to present this report which is the result 
of work carried on under an extension to contract No. NAS2-9456, a preliminary 
study for a Numerical Aerodynamic Simulation Facility. The primary objective 
of this extension is to produce an optimized functional design of key elements 
of the candidate facility defined in the Final Report^^^ of the basic contract. 

This is accomplished by effort in the following tasks: 

• To further develop, optimize and describe the function description 
of the custom hardware. 

• To delineate trade-off areas between performance, reliability, 
availability, serviceability and programmability. 

• To develop metrics and models for validation of the candidate 
systems performance. 

o To conduct a functional simulation of the system design. 

o To perform a reliability analysis -of the system design. 

9 To develop the software specifications to include a user level high 
level programming language, a correspondence between the pro- 
gramming language and instruction set and outline the operation ’ 
system requirements. 
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The resvilts of this effort are presented in five separate chapters: 

Chapter 2, Functional Description includes a summary of the system 
parameters, block diagrams, descriptions, of the major elements and 
the instruction set with detailed timing. 

Chapter 3. Software Issues describes the extensions and restrictions 
on the FORTRAN language and compiler at the functional level a 
discussion of converting statements in extended FORTRAN into machine 
language and a statement regarding the operating system. 

Chapter 4. Simulations presents the models, metrics and methodology 
for conducting the simulation along with preliminary results. 

Chapter 5. Reliability includes two sections. The first presents the 
results of an availability analysis of the systems and the second present 
further discussion of the error detection, correction and control to be 
employed. 

Chater 6. Trade-offs delineates and discusses a large number of 
design and operating factors for which reasonable alternatives exist. 

While the information in this report is designed tp stand alon^ it is also considered 
to be a supplement to the Pinal Report (Ref. 2) of the basic NAS2-9456 contract 
where appropriate, reference is made to this report rather than to unnecessarily 
repeat previously reported information. 

In addition, it should be pointed out that certain terminology used in the previous 
report have been revised. The new terms are: 

o Flow Model Processor (FMP). This is the portion of the system 
previously called the Navier-Stokes Solver (NSS). 
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Processor Data Memory (PDM) was previously called Processing 
Element Memory (PEM) 

6 Processor Program Memory (PPM) was previously called 
Processing Element Program Memory (PEPM) 
c Execution Unit (EU), the logic portion of the array processor, 
formerly called Processor Element (PE). 

The following sections summarize the chapters in additional detail. 

1.2 FUNCTIONAL DESIGN 

TheFMP is an array processor of 512 processors, a control unit, and 521 
modules of extended memory, as described in Reference 1. The major addi- 
tions found in Chapter 2, to the description of reference 1, are, first, the 
provision of SECDED, instead of parity-plus-retry, as the expected means of 
error control in the processors’ memory, second, the addition of four on-line 
spare processors as definitely a part of the design (they are mentioned briefly 
as a possibility in reference 1); third, significant revisions and additions to 
the instruction set; fourth, the restriction of the extended memory instructions 
to fetching 512 words (one per processor) per instruction, (the earlier description 
had EM instructions fetching 512 X N words per instruction); and fifth, provision 
for special hardware for computing any floating-point variables that are not 
members of a vector. 

Chapter 2 includes diagrams and figures of every element of the FMP. 
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1.3 SOFTWARE- 


The software chapter covers -the FORTRAN language, to a depth necessary to 
cover simple test cases, discusses hand compiling, and is charged with the 
task of reporting on progress in defining the operating system during this 
contract extension. Three and only three extensions are visualized for the 
initial FORTRAN language. First,, the DOALL construct declares to the compiler 
that the iterations of a particular loop can be done in any sequence, or all in 
parallel, without affecting the result; second, declarations of several types of 
use of variables are used to allocate those variables among the different types 
of memory; third, certain system library functions are required, because of the 
parallel nature of the machine, that would not be required in serial FORTRAN., 
^Q■one of these library functions are required for the initial benchmarks. 

The operating system is extensively described in reference 1. The level of 
detail in that document is such that the effort of the contract, extension was 
spent more fruitfully on language definition, compiler considerations, and hand 
compilation procedures. Thus, the operating system discussion in reference 1 
still stands as the best description so far produced of the operating system of 
the FMP. No attempt has been made to update that description for this report. 

1.4 SIMULATION 

Chapter 4 discusses the separation of the simulation effort into two levels, 
instruction and FMP level, and the system level. Metrics for each level are 
discussed, and SUBROUTINE TURBDA has been selected as the metric for the 
simulation done in this extensionis also given. The BOSS simulator, in which 
our simulation is being done, is described briefly in chapter 4. 
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1.5 RELIABILITY 


A detailed computer model for the reliability of the FMP was run. The results 
of this model bound the availability at 96 percent being the lower limit of 
availability using pessimistic assumptions, and better than 99 percent 
availability being achieved under the most optimistic assumptions. The use 
of spare processors with operating system automatic restart (assumed success- 
ful for some fraction of all attempts) produces a very significant improvement 
over the model that has no spare processors. 

The reliability section also includes a discussion of the use of SECDED in all • 
memory, of the process of "scrubbing” out the errors that spontaneously arise in 
CCD storage (DBM), and- of other error control strategems that are used in 
the FMP. 

1.6 TRADEOFFS 

Chapter 6 discusses tradeoffs in many areas. These include ease of program- 
ming versus execution efficiency, where one wishes to have most of both, 
word and instruction formats, error control methods versus their cost in 
reduced throughput, several specific design issues, relative speeds of specific 
blocks of the system, alternate methods of supplying the floating-point scalar 
capability, and other topics, with a final section on the expansibility of both 
the specific FMP, once built, and the expansibility of the design from which 
it was built. 
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CHAPTER 2 


FUNCTIONAL DESCRIPTION OF NSS HARDWARE 


2.1 INTRODUCTION 


This functional description is arranged in several successive 
sections. First, a brief system description of the SAM that is 
the baseline system for FMP is given. Second, a brief list of 
system parameters is provided. Third, the elements of the system 
block diagram are each described in turn. Fourth, the instruction 
set of the FMPis given, together with its timings. 

In all of this, it has not been felt necessary to repeat material 
that is found in the final report of contract NAS2-9456, except 
very briefly to refresh the reader's recollection. It is pre- 
sumed that the reader has first read that report. 

No design should be considered to be necessarily final if further 
investigation should show that the machine performs better with 
the feature modified. Chapter 6, "Tradeoffs", is a discussion of 
many of the features that will be studied in simulation during 
phase 2 (time permitting), and which are therefore likely to be 
modified in the direction of higher throughput if the baseline 
system is found wanting. 

This functional description is intended to provide the base for 
the information input to a performance simulation of the SAM of 
the FMP. Some of the information, such as error correction cap- 
abilities, i-s included for completeness in spite of the fact that 
it has no apparent involvement in a performance simulation. 
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2.2 BASIC SYSTEM PARAMETERS 


Most of the basic system parameters were covered in some detail in 
the final report Ref. 1. They are summarized here along with 
additional information of specific interest. 

2.2.1 Logic Family - ECL is the preferred logic family. Final 
selection of circuits for implementation at this time would only 
lock us into choices that will become obsolete by 1979-1980 when 
the design is completed. We do not wish to preclude the use of 
up-to-date technology in the actual design. If the final design 
were being implemented at this time, Fairchild's lOOK series would 
be chosen, together with compatible memory circuits. The chip 
count projected for 1979-1980 is the one assigned to the baseline 
system. Confidence in this package count is supported in most 
cases by the very similar chip count, of circuit types already 
available in 1977 (usually ECL lOOK) , which are also given. 

2.2.2 Clock Rate - The clock has been assigned a 40 ns period. 

The instruction times, given below in terms of this clock period, 
are compatible with the instruction times derived from a prelim- 
inary processor design using ECL lOOK. 

2.2.3 Cabl ing Methods - The same flat belts used successfully in 
prior projects in Burroughs for transmitting high-speed signals 
with fast rise time and low crosstalk will be used for most of the 
interunit cabels. Reference 1 discusses this choice. 

2.2.4 Power - While a number of, comments on power were included 
in reference 1, certain detailed information was not. These 
details are provided in the following statements. 

• Switching regulators will be used for the sake of effi- 
ciency. A net efficiency of 65% is expected from the total 
power supply. 

• DBM is provided with whatever power is required to make it 
nonvolatile against glitches and short power outages. 

Since CCD is proposed for DBM, battery backup would be 
highly desirable. 
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• The ground return from backplane to power supply is never 
used as part of the path that connects one backplane ground 
to another backplane ground. Figure 2-1 shows the ground- 
ing arrangements expected 

. Total power for the FMPis estimated (very approximately) 
at 250 kw, based on an average of 0.8w for each of the 
200,000 circuit packages, and 65% efficiency in the power 
supply. These are for the 1980 projected circuit counts. 

. Every module has its signal ground tied to chassis so that' 
there will be no floating grounds when the modules are 
tested as stand-alone modules. In Figure 2-1 these ties 
are shown as resistors. 

A requirement on power supplies employed at NASA AMES is that they 
must ride through the undervoltage transients produced by wind 
tunnel motor startup, and not pass voltage spikes. In addition, 
they should be reasonably respectful to the source. ipwer 

supply configurations. satisfy this requirement. 

. Motor-generator set. Inertia enables an M-G set to ride 
through large transients. The inefficiency of the M-G set 
is multiplied into the inefficiency of the system power sup- 
plies. The advantage of an M-G set is that it can be added 
to a system after the fact, without impacting any existing 
design. 

. Transformerless rectifiers, like the old AC-DC radio, 

require a filter capacitor, which suppresses spikes, and if 
large enough, will ride through undervoltage transisents. 

The unregulated DC (about 280v) is distributed around the 
equipment and used as input to individual switching 
regulators. SCR rectifiers are to be avoided, since they 
inject noise back into the line. 
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. Battery back-up Uninterruptible Power Supply (UPS). 

Of the three schemes, the transformerless rectifier is most 
efficient, and takes the least space. It also has the advantage 
that back-up batteries can be supplied to a selected subset of the 
equipment (DBM, in this case). It is also easy to make the 
rectification redundant. Three-phase full wave rectifiers are 
actually six-phase for ripple characteristics. They often need no 
chokes, and have wide conduction angles in the rectifier diodes. 

2.2.5. Number of Processors - A key decision in the design of the 
FMP is the choice of the number of processors to be implemented. 
The design presented here is based on using the fastest processor 
that is consistent with the speed of memory built of 16k-bit 
static RAM chips. Projecting 100 ns speed for such chips, we 
arrive at a 360 ns floating point multiply as being approximately 
in balance. A faster processor • would yield increased speed only 
if the memory were changed to the faster 4k-bit chips, implying a 
four-fold increase in the number of components in memory. 
Reliability, even more than cost, tells us to keep the parts count 
down, and therefore to design a system consistent wit h 16 k-bit 
memory chips. It takes about 512 processors, at these speeds, to 
yield the desired billion floating point operands per second with 
sufficient margin for inefficiencies. 

2.3 OVERVIEW OF FUNCTIONAL DESCRIPTION 

2.3.1 Block Diagram 

Figure 2-2 (a slightly expanded copy of Figure 1-2 of the Ref. 1) 
shows the array processor consisting mostly of 512 processors 
attached by a switch, the Transposition Network, to 521 Extended 
Memory modules which hold the main data base of the program. Used 
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as a staging area for jobs not yet started, and as the output area 
for jobs in process or completed, is Data Base, Memory. A Control 
Unit synchronizes the action and controls the transposition 
network and the transfers in and out on both faces of the extended 
memory. The controller for the Data Base Memory also accepts 
requests from the host processor to transfer to and from the host 
disk pack file system. The Data Base Memory controller resolves 
access conflicts to and from data base memory. The' Control Unit 
resolves accesses to and from Extended Memory. There is also a 
Diagnostic Controller used for maintenance and cold starts. 

Each processor is' self-contained, with integer and floating-point 
arithmetic units, its own instruction decoder, its own program 
memory, and its own data memory. In addition to the 512 
processors, four processors are included as on line spares to help 
achieve system availability requirements, The use of these 
on-line spare processors is discussed in Chapter Five. 

2.3.2 Instruction Streams 

As described in Ref. 1, the PMP is controlled by two instruction, 
streams, which are created in parallel by the compiler from a 
single sequence of source statements. One instruction stream is 
being executed in the control unit; the .other is being, executed by 
all processors asynchronously of each other. Some statements in 
the source code result in instructions in both instruction 
streams. Examples -are "CALL subroutine", or an arithmetic 
statement using an EM variable., and therefore requiring a fetch to 
all processors from the EM. Some of these joint instructions 
require that the control unit and the processors synchronize 
themselves. It has been observed that reference 1 does not seem 
to be clear in explaining synchronization, nor in explicating the 
means of accomplishing it. Therefore, the discussion digresses 
here to a detailed discussion of the synchronization mechanism. 
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2.3.3' Synchronization . 


The process of synchronization occurs within instructions. It 
involves two signal lines which go from the control unit to all 
processors, namely "CUready" and "go". "CUready" is a level, "Go" 
is a pulse that arrives at all processors simultaneously. From 
each processor there are two lines, "Enabled" is a copy of the 
"enabled" flipflop that exists in each processor; "I got here" is 
a signal, a level, which is raised during the execution of some 
instructions. 

To explain the process, consider the example of a LOADEM instruc- 
tion fetching N words from EM. In the control unit, the LOADEM 
causes the raising of the "CUready" line as soon as the TN 
controls have been set to the proper value. In each processor 
where "enabled" is true, "I got here" is raised as soon as the 
processor starts executing the LOADEM instruction. 

When any processor executing LOADEM sees "CUready" true, the 
processor sends the address through the TN to the EM module that 
is connected to this processor. The strobe accompanying the 
address causes the loading of the address within the EM module. 

An "all processors ready" signal, marking the time at which the 
last enabled processor arrives at the LOADEM instruction is 
created for the CU (The logic creating this signal is actually 
contained within the fanout tree) . Using E^ as the "enable" bit 
of the nth processor, and as the "I got here" line of the nth 
processor, the "all Processors ready" signal is given by the 
formula 

All-processors-r eady = (H^ OR ) AND ( H2 OR E2) AND ... 

AND ^512) 

There is also "any processor enabled", the OR of all the "enable" 
bits . 
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When the CU sees “all processsors ready", the CU issues, after an s 
appropriate delay to let addresses be loaded, a series of N "read" 
commands to the EM module and also issues, appropriately timed 
with respect to the last such command, a "go" pulse to the 

processors. In the processor, we load N words under control of 

the N strobes coming from EM module through the TN. The "go" 
signals the end of the instruction. 

As a second example, consider the instruction WAIT. Here no 
processor action timed to the "CUready" is required, so the CU 
sends no "CUready". When the CU sees the "all processors ready" 
signal formed from the "I got here"s and the "enable"s, it issues 
a "go" to all processors, who have refrained from executing their 
next instruction until the "go" is received. 

When the processor has raised its "I got here" line, but before it 
has received a "go" signal, if is said to be "ivaiting". The "I 

got here" line is dropped upon receipt of the "go" pulse. 

In addition to the above synchronization, the CU also has the 
power to transmit commands. The commands are carried on a 
4-bit-wide bus accompanied by a strobe line. Many of these 
commands are used in the diagnostic programs. Ref. 1, p 4-27, has 
a tentative list of operations called forth by these commands. 

Some of these commands will be conditional on the "enable" bit of 
the processor, some are unconditional independent of the enable 
bit. The only such command that is used in user-generated FORTRAN 
programs is the command that simultaneously loads the program 
counter and sets the enable bit. 

The control unit's command power is exerted over all processors at 
once, not over individual processors. Processors that do not join 
in some array-wide operation avoid it by a) jumping around the 
operation, if it is local to each processor, b) executing certain 
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instructions (LOADEM, STOREM, SHIFTN) as noops conditional on the 
last bit of an integer register in the processor, or c) executing 
the STOP instruction, which turns off the "enable" bit until the 
CU reaches some point in its instruction stream that turns it back 
on . 


There is also an interrupt line from processor to CU. 

2.3.4 Starting a Run 

During normal operation, all data and program for the next run 
will be loaded into data base memory prior to the beginning of the 
run. When the run starts, system software in the CU loads program 
from data base memory to the memory of the control unit (via 
extended memory) . The initialization phase of the program then 
transfers necessary data to extended memory, and transmits the 
processors' program to them. These actions are automatically 
inserted by the compiler and the linker. With data in place in 
extended memory, and allocated space initialized to "invalid"'? and 
with code files in place in control unit and processors, user 
execution starts. 

2.3.5 F!;^P Hardware . Summary 

The Flow Model Processor therefore consists of 

• One Control Unit (CU) with its own memory (CUM) with 
optional scalar processor capability. 

• 512 Processors, (plus 4 spares) each with its own 
Processor Data Memory (PDM) and Processor Program Memory 
(PPM) 

• One Transposition Network 
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• 521 Extended Memory modules 


* One Data Base Memory and Controller 

♦ One Diagnostic Controller 


All of the above is shown in Figure 2-2 except for the optional 
scalar processor and the four spare processors. The scalar pro- 
cessor is an ingredient of the design which was not needed in 
order to successfully match the SAM to the aerodynamic flow 
models. Since the scalar processor was not discussed in reference 
1, further discussion thereon is found in Chapter 6. 

2.4 INDIVIDUAL BLOCKS 

Following is a brief description of each of the elements of the 
FMPtogether with a formatted tabulation of pertinent features and 
.a block diagram of each. 

2.4.1 Descr iption of Tables 

For each element of the PMP, there is a table of characteristics 
given. A very short narrative description gives the intended 
function of the element in user programs. Source of control is 
identified, and the storage capabilities, both capacity and speed, 
are also given. Connectivity to other elements is broken down to 
a rather detailed level, with each group of signals that has an 
identifiably different function being so identified. In some, 
cases, such as CU to processor, signals in the same belt are 
identified as a different group in order to more clearly identify 
their use. 
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The table also discusses the mode of error control built into the 
design. Some mechanisms of error control were included in the 
baseline .system design in- the final report. Some further 
mechanisms of error control are proposed in Chapter 5. This 
Section represents a particular state of the design, not the final 
state. 

Two chip counts are given. The 1979-1980 projected chip count is 
the one projected for the baseline system. The second chip count, 
using parts now existing in 1977, is given only for corroboration, 
to indicate the reasonableness of that projection. It also 
represents the chip count of the FMP if design were frozen now. 
There are also in some cases estimates of the power drain. All 
these are included only for interest. These are preliminary. 

They have no direct bearing on the -performance evaluation 
simulation. 

"TBD" means "to be determined". 

2.4.2 Processor The array of 512 processors is charged with the 
task of executing the user computations in the program, namely the 
f loating-^point. operations on the problem variables. 

The processor executes code contained in its own program memory, 
and accepts- commands from the control unit. Certain instructions 
(see Table 2-13) are executed in synchronism with the control unit 
(and hence, by implication, in synchronism with the entire array, 
since the control unit expects cooperation from all processors.) 

The actions of the processor are delineated by the instruction set 
in the next section. Figure 2-3 shows pictorially the division of 
the processor into and execution unit, a data memory, and a 
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program memory. Figure 2-4 is a block diagram of the logic part 
of the processor, showing the independent integer and floating 
point units, with separate register files for each. Figure 2-5 is 
a diagram -of the instruction fetching and overlap machinery, which 
is explained at length below in connection with the timing of in- 
struction execution. The logic portion of the processor has been 
named the "execution unit." Table 2-1 provides data on the EU. 

Connections to the processor come from the control unit and the 
transposition network. A byte-wide (8-bit) data path is found 
both from (bdcst.) and to (HVST) the control unit. The 
synchronization signals discussed previously also come from the 
control unit. The 4-bit wide command path, and its strobe, also 
come from the control unit. The data paths to (STOREM) and from 
(LOADEM) the transposition network are each accompanied by a 
strobe. In addition, each processor is connected to backplane 
wiring that expresses its own number. Of the 129 processors in a 
cabinet, any one may be the spare processor. Suppose processor 
no. N is the spare processor. Then the backplane number for 
processors 0 through N-1 is correct, but the backplane number for 
processors Nl through 128 must be shifted own by one, to N through 
127, in order that the processors being used by the program be 
consecutively numbered. Therefore, there is a one-bit signal 
coming from the switching machinery which tells the processor 
whether or not to subtract 1 from its hard-wired processor number 
to correct for the location of the spare. 

Error control within the processor consists of bounds checks, 
reasonableness checks, and consistency checks, as listed in 
Ref. 1. See Sections 6.7 and 6.8 for further checks that may be 
implemented but at some cost in throughput. 

For justification of the 1977 component count, see appendix E of 
volume II of reference 1. 
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Figure 2-4. Internal Block Diagram of EXJ 













TO DECODING 

Figure 2-5. Instruction Fetching and Overlap 


OF POOR QUALITY 


2-16 











TABLE 2-1 

EXECUTICN UNIT CHARACTERISTICS 

UNIT: Execution Unit (EU) No. In System: 512 + 4 on-line spares 

'FUNCTICNAL CHARACTERISTICS 

Function: This is the logic portion of the processor, all the processor except memory. 

It executes code that has been wi it ten by the FMP FORTRAN compiler , including EM 
address computations, index calculations and floating point operations. 


Source of Control; During User Program: Program stored in PPM, sync's from the CU. 

During System Startup and Diagnostics: same plus CU commands 


Storages; Capacity: 16 16-bit integer registers 

16 48-bit floating point registers 
Other registers (see text) 

Speed: Multiple accesses each 40 ns clock 


Connectivity to Other Elements: 


No. 


# 

Path 

TO or From 

Sig 

Timing 

Primary Use 

1 

BDCST 

From CU 

8 

byte/20ns 

Receive global variables from CU 

2 

HVST 

To CU 

8 

byte/20ns 

Transmit result to CU (global) 

3 

LOADEM 

From TN 

9 

byte/20ns 

Receive data from EM 

4 

STOREM 

To TN 

9 

byte/20ns 

Transmit data to EM 

5 

CUinstr 

From CU 

4 

TBD 

Primarily for diagnostics 

6 

sync 

To CU 

4 

edge 

Synchronization 

7 

sync 

From CU 

4 

edge 

Synchronization . 
Processor's own number 

8 

PEno 

Wired to 
backplane 

9 

D.C. level 


RELIABILITY/REPAIRABILITy/TRUSTTORTHINESS 


Error Control Methods: TBD. Modulo 3 check on arithmetic is being evaluated. Error 

! cases are detected (see text). 

I^epair Methods: Replace and restart from restart point. On-line replacement (with manual 

pull-and-replace at a later convenience of the repairman) is very feasible. 

MTBF of Unit: See Chapter 5. 

E^raded Modes Available: Programs can be compiled to use less than all the processors 

available, thereby bypassing any failed processors. On-line switching of spare pro- 
cessors. 

PHYSICAL 

Chip Count; 1980 Projection: 100 If use 1977 parts: 160 (lOOK ECL etc.) 

(based on prelminary logic design using lOOK) 

Pysical Size: 1980: One large pc. sized module. 1977: Single removable module 

Power Drain: 1980: 150 w 1977: 300 w 
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2.4.3 Processor Data Memory - The processor data memory (PDM) 
contains work space for each processor. It is also used to hold 
local copies of global information, to facilitate their being 
fetched by the processor's program. It can be used to window data 
from EM. Control is from the memory address register in the 
processor. There are 16384 words of 55 bits, consisting of 48 
bits data and 7 bits of single-error correcting, double-error- 
detecting code. Data address, and control connections are solely 
to the processor. 16k-bit static PAM chips are used. Figure 2-6 
shows some of the logic in the processor associated with the port 
into PDM. Table 2-2 describes major characteristics of the PDM. 
See sections 6.6, 6.12, 6.13 for discussion of tradeoffs in PDM 
design. 


2.4.4 Processor Program Memory . Processor Program Memory (PPM) 
contains the code file from which the processor executes. It is 
addressed directly by the program counter. Overlay comes from the 
CO via the "broadcast" (BDCST) path. Except for the size of 8192 
words, design is identical with that of PDM. 

2.4.5 Control Unit (CU) 

2. 4. 5.1 Basic Control Unit 

The control unit, during user programs, is in charge of synchro- 
nizing the array for those instructions that require a synchro- 
nized array? it issues the "go" signal. It also handles those 
portions of the address computation that must be issued from a 
central point. The control unit executes the FMP-resident portion 
of the system software. It has a single shared memory (CUM) for 
both program and data. 
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TABLE 2-2 

CfJAEACTERISTICS OF PRXESSOR DATA MEMORY 

UNIT: Processor Data Memory (PDM) Nd. In System: 512 + 4 spares with spare processor 

• (formerly processing elanent memory PEM) 

FUNCTIONAL CHARACTERISTICS 

Function: Stores temporary variables generated by the processor during computation. 

WDrk space. Subroutine return information, Windows EM data. 


1 

Source of Control; During User Program: EU command lines 

During system Startup and Diagnostics: Same 


Storages; Capacity: 16,384 words. 

Speed: 120 ns cycle 


Connectivity to Other Elements: 





No. 



# 

Path 

To or From 

Sig. 

Timing 

Primary Use 

1 

data 

To/frcm EU 

55 

static 

Fetch and store data 

2 

address 

From EU 

16 

static 

Address 

3 

control 

From EU 

2 

edge or 
static 

Command 


RELIABILITY/REPAIRABILITY/TRUSTWORTHINESS 

Error Control Methods: SECDED 

Repair Method: Removed with entire processor. Not a separate entity. 

MTBF of Unit: Dominated by control chips because of SECDED, 

Degraded Ifodes Available: Programs compiled to less than 512 processors bypass failed 

’PEM's. Error correction allows program to continue, but with reduced reliability, in 
single-bit failure cases. Cn-line switching of failed processors. 

PHYSICAL 


Chip Count; 1980 Projection: 70 

(55 16k-bit mem + 15 control) 

Physical Size; 1980: Part of processor assy. 

Power Drain; 1980: 


If use 1977 Parts: 250 

(lOOK ECL, etc.) (220 4k-bit mem. 
+ 30 control) 

1977: Part of processor assy. 

1977: 


oeigwaltaqeB 
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FROM EU 



TO INSTRUCTION REG. 


TO/FROM 
^ SECDED 
CHECKER/ 

GENERATOR 

AND BYTE-SERIALIZER 


Figure 2-7. PPM Logic 



TABLE 2-3 

PROCESSOR PROGRAM MEMORy CHARACTERISTICS 

UNIT: Processor Program Memory (PM) No. In Systen: 512 + 4 spares with 

spare processor 


FUNCTIONAL CHARACTERISTICS 

Function: Contains program for the processor. Is loaded using the BDCST path from 

the CU. 


Source of Control; During User Program: Processor's program counter. 

During System Startup and Diagnostics; Same 


Storages; Capacity: 8,192 words 

Speed : 120 n& 


Connectivity to Other Elements: 





No. 



# 

Path 

Tb or From 

Sig. 

Timing 

Primary Use 

1 

program 

To/From EU 

55 

static 

Fetch and load program 

2 

address 

From EU 

16 

static 

Address 

3 

control 

From EU 

2 

edge or 
static 

Command 


RELIABILITY/REPAIRABILITY/TRUSTWORPHINESS 

Error Control Methods: SECDED 

Repair Method: Remove with entire processor. Not a separate entity. 

MTBF of Unit: See Chapter 5 

Degraded Modes Available: Program conpiled to less than 512 processors bypass failed PPM's. 

Error correction allows program to continue at reduced reliability, in single bit 
failure cases. On-line switching of failed processors. 

PHYSICAL 


Chip Count; 1980 Projection 43 
■ (28 men + 15 control) 

Physical Size; 1980: 

Power Drain; 1980: 


If use 1977 parts: 140 

(lOOK BCL, etc.) (110 mem + 
30 control) 

1977: Part of processor assy. 

1977; 


Part of processor assy. 
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The control unit can also be controlled by commands from the host 
computer issued via the Diagnostic Controller (DC) . This mode of 
operation is supplied for the purpose of performing diangostics. 

The control unit is at once the most complex, in terms of variety 
of functions performed, and the most pedestrian, in terms of the 
demands it makes on the logic designer, of all the units in the 
FMP. Such hand analysis as has been done indicates that for the 
aerodynamic flow problems, the control unit will most of the time 
be waiting on the processors. One of the aims of the simulation 
is to find out if this statement is really true, or whether an 
investment in a faster control unit will pay off. 

The frequency with which the CU executes system software upon 
interrupt, in the middle of user executions, will affect the 
required speed of the CU. The present plan is to so allocate the 
tasks in the system that during normal executions no interrupts 
either from host or resulting from FMP code are expected. 

The host initiates file-system-to-DBM transfers using its copy of 
the DBM allocation map and issuing I/O commands directly to the 
DBM controller. No FMP-resident routine is involved in the 
initiation or completion of these transfers. The DBM controller 
resolves any potential conflict between these host transfers and a 
CU-initiated DBM-EM transfer. 

Figure 2-7 is the block diagram of a control unit built around a 
single bus for transferring all data to and from memory, and using 
this same bus for one of the register file outputs. Such a 
structure defeats overlap but simplifies design. If simulation 
were to show that a faster CU is needed, a faster CU would be 
built. 
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Figure 2-8. CU Block Diagram 
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In addition to the portion shown in Figure 2-8, the control unit 
also contains a section which resolves conflicts for EM between 
the instructions of the NSS and the needs of the DBM controller . 

The control unit has four semi-independent execution stations, 
just as the processor has three. The degree to which the 
execution of the independent sections is to be overlapped is a 
subject for study during simulations in future work. Using the 
two aerodynamic flow models as benchmarks tells us that no overlap 
is required, therefore specifying an exact mechanism of overlap 
has been deferred. The four units are: 

* Integer Unit 

* Memory Control 

* Floating Point Unit (optional, can be omitted if it is 
determined that so called scalar processor capability is 
not required for the contemplated applications. See 
Section 6.5) 

* Interface to host and DBM controller 

Instruction timing is given in the next section, 2.5. Table 2-4 
lists the features of the CU. 

2. 4. 5. 2 Scalar Processor 


Floating point scalars are an item of concern in some applica- 
tions. In the baseline system, an optional design feature to 
handle floating-point scalars is a floating-point arithmetic 
capability in the control unit. For a discussion of other options 
for attaching scalar capability to the FMP, see section 6.16. 
Scalar floating point capability is not be be confused with the 
"scalar unit" found in some other designs. The addressing and 
control functions of such a "scalar unit" are included in the 
control unit here whether or not the floating-point option is 
included . 
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TABLE 2-4 

CCNTEOL UNIT CHARACTERISTICS 


UNIT: Control Unit: (CU) No. In System: 1 

I 

FUNCTIONAL CHARACTERISTICS: 


Function: Executes the non-arr^ portion of the FMP program. Executes the FMP resident 

portion of the system software. 


Source of Control; During User Program: Program stream contained in Ctontrol Unit Memory 

During System Startup and Diagnostics: Same plus commands issued from Diagnostic 

Controller 


Storages; Capacity: Integer Register file, perhaps 16 words, exact number to be determined 

by simulation. Floating point register file of 16 words. 

Speed; Single-clock access to two registers per file. 40 ns clock. 


Connectivity to Other Elements; 


# 

Path 

To or From 

Sig. 

Timing 

Primary Use 

1 

control 

TO DBM Controller 

TBD 

TBD 

Control of DBM-EM transfers 

2 

return 

From DBM Controller 

TBD 

TBD 

Completion, error, EM conflict resolution 

3 

control 

To BM 

TBD 

TBD 

Ctontrol of EM 

4 

return 

From EM 

TBD 

TBD 

Monitoring, errors, interrupt 

5 

control 

To TN 

13 

TOD 

Control of TN 

6 

STORCU 

To TN 

9 

byte/20ns 

Data to be stored in EM 

7 

DOADCU 

From TN 

9 

byte/20ns 

Data fetched from EM to CD 

8 

command 

To Processor 

4 

TBD 

Di^nostic commands to the processor 

9 

sync 

To Processor 

4 

edge 

Synchronization of array 

10 

sync 

From Processor 

4 

edge 

Synchronization of array 

11 

BDCST 

To Processor 

8 

byte/20ns 

Broadcast data 

12 

WST 

To Processor 

8 

byte/20ns 

Data (such as glob^ max) to CU 

RELIABILITY/REPAIRABILITY/TRUSTWDRFHINESS 


Error Control Methods: TBD 

Repair Method: TBD. Repair in 

place; 

BMP is down 

until CU repaired 

MTBF of Ifriit 

; See CSiapter 5 





Degraded Modes Available; None. 


PHYSICAL 

Chip Count; 1980 Projection: 3,000 chips 

(a coarse estimate) 

Physical Size: 1980 

Power Drain; 1980 


If use 1977 parts: 4,000 chips 

(100k ECL, etc.) 

1977: 

1977; 
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The FORTRAN language and compiler of chapter 3 makes no use of the 
floating-point option in the CU, as there was no use for it in the 
four codes used for benchmarking. 

2.4.6 Control Unit Memory (CUM) 

The control unit memory holds both program and data for the 
control unit. It is addressible only from- the control unit, and 
sends all data into the central data bus of the control unit. 

The control unit memory is identical in electrical design and uses 
the same 16k-bit RAH chips as the processor memories. Its size is 
subject to verification via simulation. The size resulting from 
considerations of the flow-model matching study is 32,768 words. 

The control unit memory is initially loaded from DBM at the 
beginning of each run using a routine which is itself resident in 
CUM and executes on the CU. The routine transfers data and 
program from DBM to CUM via EM. 

Data on the control unit memory is found in Table 2-5. 

2.4.7 Extended Memory Module 

Extended memory (EM) is the "main" memory of the fmp, in that it 
holds the data base for the program during program execution. 
Temporary variables, or work space, can be held in either EM or 
RDM, as appropriate to the problem. All I/O to and from the FMP 
is to and from EM via DBM. Control of the EM is from two sources, 
the first is instructions executed in the CU, the second is the 
DBM controller which handles the DBM-EM transfers. In the 
baseline system design,’ the DBM-EM rate is such that the CU can be 
given first priority into EM without losing any of the DBM-EM 
transfers, therefore, the CU instructions have priority in the EM. 


2-27 



TABLE 2-5 

CHARACTERISTICS OF COSITRDL UNIT MEMORY 

UNIT; control Unit Memory (CUM) No. In System; 1 

FUNCTICNAL CHARACTERISTICS 

Function: Contains data local to the CU, and CU's program. Also contains processor 

program as source for overlay during runs. Holds mailbox for host-EMP conmunication . 
Holds copy of DBM allocation map. 


Source of Control; During User Program: CU 

During System Startup and Diagnostics: Same plus may be accessed by DC if CU not running 


Storages; Capacity; 32,768 words. 
Speed: 120 ns cycle 


Connectivity to Other Elements; 


#' 

Path 

To or From 

No. 

Sig. 

Timing 

Primary Use 

1 

data 

To/from CU 

55 

static 

Fetch and store data 

2 • 

address 

From CU 

16 

static 

Address 

3 

command 

From CU 

2 

edge or 

Conmand 





static 



RELIABILITY/REPAIRABILITY/TRUSTWORTHINESS 


Error Control Methods; SECDED 

Repair Method: FMP is down vhile CUM iS' down. Must replace failed modules for EMP to 

recover . 

MTBE; of Unit; Dominated by control logic because of SECDED 

Degraded Modes Available: Error correction allows program to continue at reduced 

reliability; in single-bit failure cases. 

PHYSICAL 


Chip Count; 1980 Projection: 175 chips 

(110 mem + 15 control) 

Physical Size; 1980: TBD 

Power Drain; 1980: 


If use. 1977 parts: 470 i ■ 

(100 BCL, etc.) 440 mem + 30 control) 
1977; TBD 
1977: 
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EM consists of 521 identical modules, which are accessed in 
parallel. 521 is a prime number for the sake of allowing 
efficient parallel fetching for all vectors of any length (with 
the minor exception of any vectors that happen to have elements 
spaced apart in memory by exactly 521). 

From each EM module we need a transfer rate and access time 
consistent with the most economical implementation. For the 
baseline system, an implementation in 64k-bit dynamic RAM was 
chosen, as being the most economical implementation available by 
1980. The low chip count also enhances reliability. Projec- 
tions say that a 64k-bit chip will have 250 ns cycle time by that 
date. The 280 ns cycle time of the memory is compatible with the 
140 ns per word transfer rate through the transposition network. 
Each word carries single- error-correction-double-error-detection 
code, which is generated at the source (DBM, CU, or processor) and 
also checked there, so that transfer paths are covered by the same 
error control as the contents of EM. 

Having decided on a TN that is almost twice as fast as the EM 
module, it would be possible to build the EM module in two 
interlaced submodules, if it the streaming mode of fetching were 
to see much use. Section 6.10 discusses the tradeoff between 
implementing or not implementing this streaming mode of access. 

The baseline system as described in this document avoids the com- 
plexities of a design suitable for streaming, which includes among 
other things, a capability of incrementing the address in the EM 
module by nonunity increments. The chip count of table 2-6 does 
not include any incrementer. 
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Figure 2-9. EM Module 
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TABLE 2-6 

EM MODULE CHARACTERISTICS 

UNIT; EM Module No. in System: 521 

FUNCTIONAL CHARACTERISTICS 

Function: Stores problem data base during program executions. Most nearly corresponds 

to "core" of conventional processor. 


Source of Control; During User Program; Receives commands from CU 
During System Startup and Diagnostics: Same 


Storages; Capacity; 65,536 words 

Spe^; Access time 200-250 ns, interlaced for 140 ns/word block transfer 


Connectivity to Other Elements; 


No. 


# 

Path 

To or From 

Sig. 

Timing 

1 

LOADEM 

To TN 

9 

byte/20ns 

2 

STOREM 

From TN 

9 

byte/20ns 

3 


To DBM 

9 

full word 
in 400 ns 

4 


From DBM 

9 

full word 
in 400 ns 

5 

No 

From 

backplane 

10 

D.C. level 

6 

Control 

From CU 

TBD 

IBD 


Primary Use 

Fetching data to processors and CU 
Storing data from processors and CU 
Results back to DEM 

Initial data (and eventually, overlay) 
from DBM 

Module's own number 
Controls EM operations 


RELiABILITY/REPAIRABILITY/TRUSTWORTHINESS 


Errpr Control Methods; SECDED (providing acceptable error rates are demonstrated) 
Repair Method; Remove and replace 

^^^BF of Unit; Control dominates failure inodes because of SECDED. 

Degraded Modes Available; Data continues to be corrected even vAien there is one hard 
error, allowing the current program to canplete before repairs are undertaken. 


PHYSICAL 


Chip Count; 1980 Projection; 86 
(55 memory + 30 control) 

Physical Size; 1980; One medium sized 
p.c. board 
Power Drain; 1980 


oeigwalpagb® 

OF POOR QpALlTX 


If use 1977 parts: 274 

(lOOK ECL, etc.) (224 mem. + 50 control) ■ 
1977: 

1977: 
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Figure 2-8 shows the EM module, including two address registers, a 
one-word buffer for DBM transfers, and an access path to the EM 
modules own number, wired into the backplane. Table 2-6 gives the 
data on the EM module. 

2.4.8 Fanout Tree 

A series of fanout boards is supplied to provide the CU to 
processor connection. From CU to processor, s signals fan out to a 
final 512 destinations. From the processors, the signals are 
combined, so that, within the CU, a single result appears in 
response to 512 signals emitted by the processors. For example, 
the "all processors ready" signal becomes true at the clock that 
the last enabled processor emits "I got here". Another such 
signal is the 512-input OR of "enabled". 

At the processor, some signals are wired per-processor directly to 
the last level of fanout board; others are daisy-chained to eight 
processors from a single signal pin on the last board. The fanout 
boards are pin-limited. Simple buffers with one input pin and one 
output pin per signal dominate the circuit count, so hex buffers, 
easily available today, will not be improved upon by 1979-1980. 

Data on the fanout tree is in Table 2-7. The figure demonstrating 
the fanout tree is Figure 2-10. 

2.4.9 Transposition Network 

The transposition network allows the fully parallel, 512-wide, 
fetching of sets of variables that are to be processed in 
parallel. Up to 512 elements in one-dimensional vectors of any 
type can be fetched at full speed in parallel. When DOALL loops 
have two index variables, two-dimensional subsets of 
multidimentional arrays can also be fetched in parallel. For 
details, see Ref 1, and Chapter Three. 
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TABLE 2-7 

FANOUT TREE CHARACTERISTICS 


UNIT: Fanout Treer CU to Processors No. In System; 1 

FUNCTIONAL CHARACTERISTICS 

Function: Provides fanout for signals from CU to the 512 processors; accepts signals from 

. the 512 processors and combines them appropriately for the CU. Consists of 36 boards. 


Source of Control; During User Program: Nd control; all passive logic. 
During System Startup and Diagnostics: Same 


Connectivity to Other Elements; 


# 

Path 

TO or Prom 

No. 

Sig. 

Timing 

Primary Use 

1 

command 

From CU 

4 

TBD 

Diagnostic 

2 

sync 

From CU 

. 4 

edge 

Synchronization of array 

3 

sync 

TO CU 

4 

edge 

^chronization of array 

4 

BDCHT 

Prom CU. 

8 

byte/20ns 

Broadcast data 

5 

HVST 

TO CU 

8 

byte/20ns 

Data to CU (such as global MAX) 

6 

command 

To proc. 8's 

4(x 64) 

TBD 

Diagnostic 

7 

sync 

To proc. 8’s 

4(x 64) 

edge 

Synchronization of array 

8 

sync 

From proc. 

4(x 512) 

edge 

Synchronization of array 

9 

BDCST 

TO proc. 8's 

8(x 64) 

byte/20ns 

Broadcast data 

10 

HVST 

From proc. 8’s 

8(x 64) 

byte/20ns 

512-input OR of data from processor to CU 


1st 8-way OR done on proc. wiring 


RELIABILITy/REPAIFABILITY/TRUSTWDRTHINESS 

V 

Error Control Methods: SECDED on broadcast and harvest data. 

Repair Method; Remove and replace of defective boards. 

MTBF of Iftiit; See Chapter 5. 

Degraded Modes Available; None 

PHYSICAL 


Chip count; 1980 Projection; 2,000 chips 
all small scale integration. Dominated by 
1,504 hex buffers. 

Physical Size; 1980: 32 cards of 60-80 chips 

each 

Power Drain; 1980; 1.6 kw 


If use 1977 parts: 2,000 chips 

(lOOK ECL, etc.) 

1977 : Same 

1977 : Same 
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The transposition network consists of 521 switchable data paths 
from EM to processor, and another 521 data paths from processor to 
EM. There are two 10-bit control registers, one for offset of the 
starting element, and one for skip distance. Since there are two 
sets of data paths, the first from processor to EM module, and the 
second from EM- module to processor, the settings of the two paths 
could be separately controlled. There is just one instruction 
that would go faster if both paths are used simultaneously with 
different settings, namely SHIFTN (see Table 2-10 and 2-11 for a 
description). SHIFTN is used in functions that operate 
"horizontally" across the parallelism of the array, such as global 
sum, global maximum, or global product. SHIFTN would also be used 
to implement a Fast Fourier transform on the FMP. In the aero 
codes used as benchmarks, there is very little use of SHIFTN, so 
there is no justification for having separate settings for the 
first and second data paths,, and bidirectional data paths would 
serve as well. 

A three-bit command register enables the following commands: 

1. Enable transfers between processor and EM. The presence 
or absence of actual transfer is signified by the presence or 
absence of a signal on the strobe line that accompanies each 
byte-wide signal path. 

2. Enable transfers between CU port and EM. 

3. Enable transfers between the remaining eight paths and EM 
(built into the design to allow these eight ports to service 
the scalar processor) . 

4. Broadcast from selected EM module to all processors. 

Table 2-8 gives the characteristics of the transposition network. 
Figure 2-11 shows the barrel switches that implement it. 
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FROM EM MODULES TO EM MODULES 
(LOADEM LOADCU) (STOREM,STORCU) 



CONNECTIVITY 
SCRAMBLED 
(SEE REF. 1) 


INVERSE OF 
CONNECTIVITY 
SCRAMBLE 
(SEE REF. 1) 


Figure 2-11. Transposition Network 
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TABLE 2-8 

TRANSPOSIUCN NETWORK CHARACTERISTTCS 

UNIT: Transposition Network (TN) No. In System: 1 

FUNCTIONAL CHARACTERISTICS ; 

Function: Provides 521 data paths for fetching in parallel from all EM modules to all 

processors; provides 521 data paths for. storing in parallel from all processors to 
512 EM modules. Provides path from any one EM module to ail processors. Provides 
data path to any EM module from CU, also path from any M module to CU. 


Source of Control; During User Program: Commands from GU. 

During System Startup" and Diagnostics: Same 


Storages; Capacity; None. Command register 10 bits offset, 10 bits skip distance, about 
3 bits of command. 

Speed; 


Connectivity to Other Elements: 


No. 


# 

Path 

To or From 

Sig. 

Timing 

Primary IBe 

1 

LOADM 

To Processor 

9(x 512) 

20ns/byte 

Data to processor during LOADEM 

2 

STOREM 

From Processor 

9(x 512) 

byte/20ns 

EM addresses' and STOREM data from pfoc 

3 

LOADCU 

TO CU 

9 

byte/20ns 

' Data to CU during LOADCU 

4 

STORCU 

From CU 

9 

byte/20ns 

Data and address from CU. 

5 

— 

To EM modules 

9(x 521.) 

byte/20ns 

Data and address to EM modules 

6 

— 

EYom UM modules 

9{x 521) 

byte/20ns 

Data from EM modules 

7 

control 

From CU 

13 

TBD 

Reset controls 

8 

spare 

To TBD 

9(x 8) 

byte/20ns 

Reserved for scalar processor 

9 

spare 

From TBD 

9(x 8) 

byte/20ns 

Reserved for scalar processor 


RELIABILITY/REPAIRABILITY/TRUSTWDRTHINESS 


Error Control Methods: SECDED applied to M word passes through IN. Detects hard 

failures, corrects transients. 

Repair Method; TBD 

MTBP of Unit: See chapter 5 . 

Degraded Modes Available; Some portion of the TO can be bypassed by programs that are 
conpiled for a less-than full complement of processors. Most, however, cannot. 

PHYSICAL 


Chip Count; 1980 Projection; 10,980 
{10,480 shifter chips 4 500 control) 

Ehysical Size; 1980: About .200 boards' 
if 20.0 signals allowed per board. Is 
pin limited. 

Power Drain: 1980: 


If use 1977 parts: 17,270 

(lOOK ECL, etc.) 16,770 P 100158 chips 
4 500 control) 

1977 : Same 


1977: 
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2.4.10 Data Base Memory (DBM) 


Data Base Memory (DBM) is the window in the computational envelope 
of the FMP. All jobs to be run on the FMP are staged into DBM 
before running both program and data, all output from the FMP is 
staged through the DBM. At some future time (but not with the 
initial operating. system) DBM could be used to back up EM for 
those problems whose data base is larger than EM. Control of the 
data base memory is from a DBM' controller, which accepts commands 
both from the CU for transfers between DBM and EM, and from the 
host for transfers between DBM and the file system. 

Many design options exist for the data base memory. Out of this 
set of options one particular design was chosen for the baseline 
system. This chosen design is a CCD memory built out of 
256k-chips, which are projected to be available in the 1980 
period. If data base memory were to be built before the 
appearance of sufficiently economical CCD chips, one would use 
some form of parallel-head rotating magnetic storage. The design 
described here is based on the existence of 256k-bit CCD chips 
each arranged in the form of 128 shift registers of 2,048 bits 
each. 

With a projected shift rate of 2.5 MHz in the CCD chips, a desired 
transfer rate of 2.5 Mwd/s to and from EM, DBM is built 55 chips 
wide, for parallel emission of 55-bit words, by 512 chips deep. 

The natural block size with 2,048 bits in each shift register 
delivering a block of 2,048 words, is adopted. There are 64k 
blocks for a total of 134,217,728 words. Error correction is a 
SECDED, probably the modified Hamming-plus-par ity implemented by 
Motorola's 10,163 chip. 


2-38 



since the array of CCD chips is 512 x 55, the DBM is constructed 
in a number of physical modules, say each one 64 x 55 chips. The 
repair philosophy is to pull and replace individual modules, and 
the degraded mode of operation would be to run with one or more 
modules missing, and the operating system would have to know to 
avoid assigning any data to that space. 

There are several (probably four) block-sized buffers, which stand 
between the CCD storage and the host interface, in order to reduce 
the interference with DBM-EM transfers produced by simultaneous 
DMB-host transfers. They can also serve as timing buffers to the 
host’s disk packs. See Fig. 2-12. 

After the transfer of a block to or from the CCD store, the shift 
registers rest at the starting position until shifting is required 
by the refresh requirements, or until the CCD store is again 
addressed, whichever occurs first. Therefore, whenever there are 
several requests for transfer pending at once-, or when they occur 
with sufficient frequency, the access time is essentially zero to 
the first word of the block. For transfers arriving at random 
times, far enough apart in time so as not to interfere, the 
average access time is given by: 

Tav = ^{Tb2/Tj.) 

where T^ is the transfer time of a single block (0.82 ms) and Tj- 
is the time between refreshes. Tj- will be in the specification of 
the device, and is expected to lie between 1 ms and 10 ms. There- 
fore, the average access time for random data at low usage, to the 
first word of the block, has an upper bound which is expected to 
lie between 0.67 ms and 0.067 ms. As traffic increases, the 
access time is mostly due to interference between competing 
accesses, while the contribution due to delay in the memory goes 
to zero. 
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Figure 2-12. DBM Block Diagram 
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TABLE. 2-9 

DATA BASE MEMORY CHARACTERISTICS 

UNIT: Data Base Memory (DBM) and its .controller No. In System: 1 

FUNCTICNAL CHARACTERISTICS 

Function: In this manory, data is staged for FMP jobs not yet started, and results of EMP 

jobs are output from the BMP. Almost all communication between BMP and host goes through 
this memory, both data and program. CCD storage is postulated, although other options 
are avail^le, including disk pack. Resolves host-CU conflicts. 


Source of Control; During User Program: DBM-M transfers controlled from CU, DBM-host 

transfers controlled from host. 

During System Starti^) and Diagnostics; Same 

Storages; Capacity: 134 x 10® words in blocks 

Speed: 140 Mb/s {‘an easily adjustable parameter) 

Connectivity to Other Elements: 

# Path To or From Sig. Timing Primary Ifee 

1 TO/from EM '8+8 words/40 ns 

2 TO/from host TBD, 2 rate matches 

paths host file 

rain system 

3 control From CU TBD TBD 

4 result To CU TBD TBD 

5 {control From host TBD TBD 

6 • result To host. TBD TBD 

RELIABILITY/REPARIABILITY/TRUSTWDRTHINESS ^ 

Error Control Methods: TBD. SECDED may be adequate, and will be used' if so. "Scrubbing" 

errors arising due to refresh will be needed in CCD memories. 

Repair Method: TBD. 

MTBF of Unit; Domniated by controls since SECDED on memory. 

Degraded ModeS' Available; Error correction codes allow valid data to be fetched in spite 
of errors in memory. Can operate with failed modules removed. 

PHYSICAL 

Chip Count; 1980 Projection; 29,160 If use 1977 parts: 

(28,160 mem + l-,000 control) (lOOK ECL, etc.) use disk pack 

Physical Size: 1980: about 150 large boards 1977; eight disk pack drives 

Power Drain ;> 1980; 1977; 


Loads EM at start, of run, unloads results 
Loading DBM, unloading results 


Receives control from CU for DBM-EM' 
transfers 

Receives control fran host for DBM- 
file-system transfers 
Monitoring and error cases 


omginalpagbb 

OF POOR QUALiril 
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As a background job, the DBM controller periodically initiates an 
access for the purpose of reading the contents of a- block and 
rewriting that same block with all detectable errors corrected, 
since errors are spontaneously created in CCD memories at a low 
rate during the refresh operation. It has been conjectured that 
these errors are caused by cosmic ray bombardment of the CCD 
chips, discharging the little capacitors by temporarily ionizing 
the oxide. The rate of periodically initiating access can 
rationally be determined only after getting the vendor's speci- 
fication on the number of refreshes per error. Preliminary 
Fairchild data, if it continues to be true, indicates that one 
should scrub through the entire DBM every seven minutes, or that 
this background task should occur at one eighth the normal 
bandwidth of the DBM. Therefore, this background access is 
initiated every 6.55 ms. Only one error-scrubbing access will be 
pending at a time, even if the delay in starting exceeds 6.55 ms. 
They are not queued. 

The DBM has a number of channels into the file system of the host. 
The number is to be determined by simulation. Initial estimates 
are that two channels provide more channel capacity than needed 
for the aerodynamic flow models. At least two are needed for 
reasons of reliability. Two are assumed for the baseline system 
design. 

No buffering is needed on the EM side beyond the one-word buffers 
in each EM module. The CU will guarantee the acceptance by the EM 
of a word coming from DBM is less than 400 ns. Likewise, when 
transferring from EM to DBM, the EM module has its one-word buffer 
loaded nominally 800 ns or more ahead of the DBM requirement, and 
this time will not slip by more than 400 ns from interference with 
array transfers. 
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DBM-EM transfers have priority in the EM controls.- However, there 
is little interference with CU-initiated EM transfers. For 
example, when transferring from EM to DBM, one EM cycle loads 521 
of the per-EM-module one-word buffers, and then waits for 208 
microseconds before another EM cycle is required for the DBM 
transfer path. 

A design decision, to be made with the aid of simulation in phase 
II, is whether the LOADEM and STOREM instructions should be 
limited to 512 words per execution, or whether they should trans- 
fer 512 X N words at a time. The description given above is 
concordant with a design in which LOADEM and STOREM are 512-word 
instructions, which are the only use made of LOADEM and STOREM in 
the FORTRAN compiler described in Chapter Three. In Chapter Six 
the implications of this choice are discussed at further length. 

Use of DBM is as a staging area for jobs going into the FMP or 
coming out of the FMP. The hardware design also permits its use 
as a source for overlaying data and program into the FMP. It is 
possible to transfer less than a full block, but not to start any 
place other than the beginning of the block. A decision to make 
heavy use of the overlay capability would result in reevaluating 
the transfer rate between EM and DBM. 

2.5 INSTRUCTION SET AND INSTRUCTION TIMING 

This section lists the instruction set together with a list of 
numbers giving the execution times of each. 
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2.5.1 Tables 


There are three tables. Table 2-10 contains the instructions and 
timing for the processor, of which there are 512. Table 2-11 
contains instructions and timing for the control unit of the 
baseline system. Since no scalar unit is required for the 
aerodynamic equations, scalar unit timings are not specifiable on 
the basis of any known application. Rather arbitrarily, the 
floating-point instructions of table 2-12 are given the same 
timing as their processor counterparts. These instructions belong 
to the option for processing floating-point scalars in the control 
unit. 

Instruction formats are easy to specify, and, have been postponed 
until more difficult issues are resolved. See section 6.5, 

2.5.2 Instruction Execution - Timing 

For the processor instructions there are three separate functional 
units involved. Each instruction has a starting time in each of 
the three- units and an ending time or does not use that unit. The 
time of execution of each instruction is dependent on its time of 
occupancy (if any) in each of the- independent execution units, 
namely; integer unit, floating point unit, and memory controls. 
The timing. is described most easily with respect to the 
instruction fetching process, which determines the starting time 
,of each successive instruction. A fourth function unit, to allow 
EM fetches and stores to transpire in parallel with other 
processing, is under consideration, but has not been included in 
this description.. 
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Entries in the table have the following significance: 

"No. of clock periods" is the number of clocks from when the 
instruction normally issues to a functional unit, to the termi- 
nation of the instruction. The instruction will always have been 
decoded from out of the staging register for at least one clock 
prior to this. 

"Unit busy" is of the form n-m, where n is the number of the 
latest clock that previous instruction is allowed to occupy this 
unit, and m is the last clock that this current instruction 
occupies this unit. 

Some instructions merely stop the instruction fetching process for 
a while, until the control unit restarts it. The clock times 
given for these instructions represent the time from first 
decoding such an instruction in the staging register, until the 
start of decoding of the next instruction, under the most 
favorable circumstances. These instructions are in tables 2-10 
and 2-11, and are WAIT, STOP, and HELP. 

2.5.3 Instruction Fetch Timing 


Timing of the instruction fetching mechanisms can be seen with 
respect to Figure 2-13. The next instruction is being held in a 
staging register. Out of the staging register is decoded the 
start times required for the functional units if this instruction 
were to start at this clock, and the time it will occupy the 
holding register. Out of the integer, the floating point, and the 
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TRIGGER TO 
PPM 



TO DECODING 


Figure 2-13. Instruction Fetching Mechanism 
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memory control functional unit is decoded the ending time 
associated with the currently executing instruction. The 
"scoreboard" compares all six times. When all four comparisons 
say the next instruction will not interfere with current 
instructions, the instruction is transferred from the staging 
register to the one or more functional unit instruction registers. 
If delayed starts in other functional units are part of this 
instruction, the instruction is passed to the holding register to 
free the staging register for the next instruction. 

The program counter always points to the next word in memory after 
the staging register contents. Thus, normally the PPM will be 
holding the next instruction word statically at its output lines. 
Only when the staging register is unloaded in less than three 
clocks (the PPM cycle) will the next word not appear. 

A complexity is the existence of half-word and full-word 
instructions. Empty halves of half-word instructions carry the 
first half of the next instruction, so full-word instructions may 
only have their first half present in the staging register. The 
first half is sufficient to determine the timing. However, the 
second half will contain any memory addresses, so when a fetch 
from memory is involved, the second half must also be fetched 
before the memory part of the operation can start. 

In the baseline system, those instructions which contain a memory 
address (either for data or as a branch address), or a literal, 
are full-word 48-bit instructions. Others are 24 bits. 
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Jumps take an extra three clocks before the first instruction on 
the path branched-to can be started. 

2.5.4 Example 

For an example of how this works, take the sequence of instruc- 
tions: 

1. FETCH from memory to integer register 

2. lADD reg. to reg. 

3. FETCH from memory to floating point register 

4. ADD from memory (indexed by integer reg.) to fl. pt. reg. 

5. ADD from mem. (indexed by integer reg.) to f 1 . pt. reg. 

6. MUL from fl. pt. reg.. to fl. pt. reg. 

7 . lADD int . reg . to reg . 

8. lADD int. reg. to reg. 

9. STORE from f 1 . pt. reg. to mem. (indexed by int. reg.) 

Figure 12-14 shows the timing diagram for this sequence, according 
to the previous instructions. The instructions are given by 
number in Figure 12-13. Each clock is 40 ns. 

The entire sequence of nine instructions takes 36 clocks, or 1,440 
ns. The sum of the "no. of clocks" column in the timing table, 
for these same instructions is 40 clocks. Overlap between 
functional units gained little in this example. It is expected to 
gain more in examples which have a higher emphisis on computing ad- 
dresses in the integer unit. In this present example, the timing 
would have come out the same if the holding register had not been 
there, if loading of the staging register were merely delayed. 
Simulation may tell us that the holding register gains nothing; 
that only the staging register is needed'. Simulation during phase 
II will attempt to evaluate the gain given by-’-the complexities 
here described. The final instruction fetching machinery will be 
the result of a tradeoff between simplicity and throughput. 
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2.5.5 Control Unit Timing 


In the absence of a completely detailed design of the control 
unit, the internal structure and overlapping capabilities cannot 
be visualized with certainty. No overlap mechanism in the conttol 
unit is described in the table except for memory- Since there are 
four semi-independent instruction execution units, these times are 
pessimistic indeed. However, for aerodynamic flow problems used 
as benchmarks, the pessimistic assumption is expected not to 
matter. For aero flow problems, the interfering CU action will be 
address calculations, which will be a solid swatch of instructions 
all for the integer unit. Thus, we postpone designing the overlap 
and look-ahead capabilities within the CU until simulation in 
phase II tells us how much design effort we should spend on them. 

It is assumed that memory fetches and stores will be overlapped. 
Fetches can be initiated before the previous instruction is 
started. Fetch and store are three clocks each. The fetch of the 
next instruction must follow the store of this one, when fetch 
follows store in the instruction sequence. 

The diagnostic controller is not used during normal program 
running. It is used only for diagnostics and for system initiali- 
zation when power first comes on, or for reinitializing the PMP 
system software. 

Instruction fetching in the CU is overlapped with instruction 
execution, but is out of the same CUM that holds the CU data. The 
instruction execution unit will look ahead by an amount yet to be 
determined . 
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The scalar processor is here implemented by adding floating-point 
capability to the control unit and the entire repertoire of 
floating point processor type instructions is added to the control 
unit instruction set. See the discussion on “Scalar Processor", 
in Chapter 6. These instructions are; 

ADD, SUB, MUL, DIV, MAD, SSQ, ADDD, MULD, LT, LE, GT, 

GE, NEG, EQ, NE, INFL, FIX, FLOAT,- INFZ, SETFL, SETZ, PAK2, 
ABS, DPF, and PENO (which yields either "0" or "512", to be 
determined) 

A scalar capability resident in the control unit may require a 
faster control unit than the one described in the accompanying 
timing tables. The degree of speedup of the design required is a 
matter to be determined by simulation. Parallel operation of 
semi-autonomous units (as seen in the processor) is one of the 
ploys used to achieve increased speed, together with fast multiply 
algorithms and other logic speedups. A method of achieving faster 
CU memory operation also may be required. Several memory modules, 
either interlaced or dedicated to concurrent and overlappable 
functions, could be included in such a design. The times shown 
here ignore these additional design options, since they will not 
be needed for aero flow benchmarks. 

2.5.6 Corresponding Times in Synchronizing Instructions 

An additional detail is the relative timing of instructions that 
must be synchronized between CU and processors. For these 
instructions, execution will proceed when all enabled processors 
and the CU have reached the instruction. For each instruction 
there is a "CU lead time", T^. The timing rules are as follows; 
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The "go" pulse is emitted from the control unit a time after 
the start of the instruction, if the "All processors ready" signal 
does not delay it. The "go" pulse is effective at the processors 
no sooner than a time Tp after the start of the instruction in the 
processor. Thus, if both CO and processor arrive at this 
instruction at the correct time that both can execute it in the 
minimum time, .there will be an offset of (Tp - T^) clocks between 
these two initiations. For various cooperating pairs of synchro- 
nizing instructions. Table 2-13 gives Tl (=Tp - T^). 

Table 2-13 contains three columns. Column 1 is the CU name of the 
instruction. Column 2 is the processor name of the matching 
instruction. Column 3 is the CU lead time T^. Negative Tx, means 
that the CU can arrive at the instruction -T^ clocks after the 
last processor- without delaying the time of the instruction past 
its last-processor start time. Tjj values tend to be negative 
because the "same" clock pulse at the CU and the processors is 
actually about 60 ns' sooner at the >CU. That is, Tl= 0 implies that 
the CU is 60 ns ahead of the processor. 

2.5.7 Exceptional Cases 

Within the processor, all fault cases result in an interrupt to 
system software that is resident in the processor. It is possible 
to handle some interrupts without interrupting the CU. Floating- 
point out-of-range detection does not cause interrupts, but 
results in setting the floating-point variables into "infinity" or 
"infinitesimal". Any integer overflow causes an interrupt, on the 
theory that most integer operations are address calculations and 
overflow represents a faulty address. Attempting to insert a 
number outside the range ±2l5_i into a 16-bit integer register 
causes an integer interrupt; likewise executing a FIXD (double- 
length integer) on a number outside the range ±2^1-1 results in 
interrupt. Any detection of error in the error-detection- 
correction logic results in processor interrupt. When the error 
is correctible, the interrupt ' merely logs its occurrence and 
returns to user processing. 
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TABLE 2-10 







PBDCESSOR INSTEDCTIONS 








No. 

Clock 

Unit Busy 
Flt'g 


Instr . 


Description 

Periods 

Int 

Point 

Hem 

Length 

ADD, SUB* 

Floating point add/subtract. Result to 
fl. pt. reg. 

Case 1. Ifeg. + Reg. to Reg. 

6 


0-6 


24 


Case 2. Reg. + Lit. to Reg. 

6 


0-6 


48 


Case 3. Reg. + Mem. to Reg. 

9 

0-1 

3-9 

0-3 

48 

MUL* 

Floating point multiply 
Case 1. Reg. x Reg. to Reg. 

9 


0-9 


24 


Case 2. Reg. x Lit. to Reg. 

9 


0-9 


48 


Case 3. Iteg. x Mem. to Reg. 

12 

0-1 

3-12 

0-3 

48 

DIV* 

Floating point divide 
Case 1. Ifeg./Reg. 

44 


0-44 


24 


Case 2. Reg ./Lit. to Reg. 

44 


0-44 


48 


Case 3. Reg ./Mon. to Reg. 

47 

0-1 

3-47 

0-3 

48 

DIVR 

Same as DIV except the second operand 
is divided by the 1st. 







Case 1. 2d operand in reg. not implemented 






Case 2. Lit ./Reg. to Reg. 

44 


0-44 


48 


Case 3. Mem./Iteg. to Reg. 

47 

0-1 

3-47 

0-3 

48 

MAD 

Floating point add product of two operands 






to third operand. Result to same regis- 
ter in vAiich third operand was found. 
Case 1. Reg. x Reg. + Reg. to Beg. 

11 


0-11 


24 


Case 2. Reg. x Lit. + Reg. to Reg. 

11 


0-11 

48 



Case 3. Reg. x Mem. + Reg. to Reg. 

14 

0-1 

3-14 

0-3 

48 

SSQ 

Floating point sum of squares 
Case 1. Reg. 2 + Reg. 2 to Reg. 

21 


0-21 


24 


Case 2. Mem. 2 + Reg. 2 to Reg. 

24 

0-1 

3-24 

0-3 

48 

ADDD, SUED 

Floatirg point sum (or difference) of 
two registers is kept in double length 
form and kept in two successive fl. pt. 
reg. llie exponents of the two results 
differ by at least 38. 

13 


0-13 


24 

MULD 

Floating point multiply, with the full 
double length result put into two suc- 
cessive fl. pt. registers in the form 
of two normalized fit. pt. words with 
an exponent different of 36 or more. 
Inputs are frcm registers 

17 


0-17 


24 

*If non-rounding versions of these instructions are supplied, the nexecution 

times 

will not 

differ frcxn 

those given for the rounding version. 
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TABLE 2-10 (cont.) 


FLIT 

lADD, 

I Am, 

lADDD, 

lADDD 

IMUL 

IDIV' 

IMLD 

IDVD 


Description 


No. Unit Busy ? 

Clock Flt'g Distr. 

Periods Int Point Mem Length 


ISUB 


ISBM 


Transfer the 32-bit literal to the 

leading 32 bits of the fl. pt. reg. 2 0-2 48 

Integer add and subtract. Both input 
operands are fran integer registers, 
result goes to a third register. One 
input may be liter a. 


Case 1. 

Reg. i Reg. or literal 

1 

0-1 

24 





(48 if lit.) 

Case 2. 

Reg . ^ manory 

4 

0-4 

0-3 24 

Same as 

lADD, ISUB, except the first 




operand and result are double-length 
(from concatenation of int. reg. with 


• 

' 

next it. 

reg.) 




Case 1. 

2d operand int. reg. 

2 

0-2 

24 

Case 2. 

2d operand lit. 

2 

0-2 

48 

Case 3. 

2d operand from mem. (16 bits) 

5 

0-5 

2-3 0-3 48 


ISBD Double-length integer add, oneoperand in 
two successive registers, second from two 
successive integer register, result to two 

successive integer registers 2 0-4 24 


ISBD 

Second (,3 2-bit) operand froii manory 

5 

0-5 — 


Integer multiply 




Case 1 reg. x reg. or literal 

9 

0-9 


Case 2 reg. x memory 

12 

0-12 


0-3 48 


24 

{48 if lit) 
0-3 48 


Integer divide. Register or literal 

divided by register, result to register 

Case 1 reg./reg. or literal 16 0-16 

Case 2 reg ./memory 19 0-19 

•Multiply double-length integer in two 
successive registers by single-length 
integer, result to two successive 

registers 17 0-17 


Divide double length integer in one pair of 
register by single length integer. Itesult 
to single-length register 32 0-32 


24 

(48 if lit) 
0-3 


24 

(48 if lit) 


24 
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TABLE 2-10 (cont.) 




No. 

Unit Busy 



Clock 

Pit 

•g 

Instr . 


Description Periods 

Int Point Mem 

Length 

ID521 

Divide double length integer in register 
by 521, leave result in double-length 
register 

13 

0-13 


24 

IMbD 

Saved remainder instead of quotient 
from IDIV 

16 

0-16 


24 






(48 if lit: 

* 

ILIT 

Transfer 16-bit literal to int. reg. 

1 

0-1 


48 

ILITT 

Transfer 32-bit literal to double-2 
length integer register formed by the con- 
catenation of two single-length int. reg. 

2 

0—2 


48 

lALIT 

Add the 32-bit literal to the designated 
doi±>le-length int. reg. 

2 

0-2 


48 

SB 

Set least significant bit of integer 
equal to the result of the preceding test 
(excecuted prior to the actual jump) 

1 

0-1 


24 

IADD1,ISUB1 

Add (Subtract) 1 from content of int, reg. 

1 

0-1 


24 

IMDD 

Same as IDVD, except result is remainder 
not quotient 

32 

0-32 


24 

ILT,ILE,IGT 

Test first integer register against 

2 


0-2 

48 

IGE,IEQ,INE 

second int. reg., if true, branch to 
location in branch address field. 






If fall thru: 

2 

0—2 


48 


If branch. 

4 

0-4 


48 

SHF 

Shift index register right end-around by 
the nimber of places found in second 
register 

2 

0-2 


24 

LT, LE, GT, 

Test operand in first fl. pt. register for 

2 


0-2 

48 

GE, BQ, NE 

compliance with condition with expressed 

4 

2-4 

0-4 

fall-thru 


condition with request to 2nd reg. new 
PCR address in address field 




if jump 

TIX 

Test integer in one register against 

2 

0-2 


48 


integer in second register, increment by 

4 

0-4 

0-2 

fall-thru 


content of third reg. Single length csily. 




if jump 

AI®,OR 

Logic combination of one integer register 
with another, result to a third 

1 

0-1 


24 
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TABLE 2-10 (cont.) 



No. 

Clock 

Unit Bu^ 
Flt'g 


Instr . 


Description Periods 

Int Point Mem 

Length 

NOT 

Complement of one integer register, 
result to a second 

1 

0-1 


24, 

BIT 

If Nth bit of integer register is OJE, fall 

2 

0-2 


24 


through, else jump to address contained in 

4 

0-4 2-4 


(48 if lit) 


second index register. N is in register or 




fall-thru 


literal 




if junp 

JUMP 

Set program counter to value found in reg. 

2 

0-2 1-2 


24 

CALL 

Subroutine entry. Involves automatic hand- 
ling of stack of return information, and 

to 

be determined. 

up 



parameter passing 

to thirty clocks 


48 

RBTUBN 

Subroutine return. Stack cut-back 

to 

be determined. 

up 




to thirty clocks 


. 24 

INFY 

Test fl. pt. reg for equal to infinity 

2 ' 

0-2 


24. 

if fall-thr 



4 

2-4 0-4 


if jump 

INFL 

Test Fl. pt. reg. for infir\itesimal 

2 

0-2 


24 

fall-thru 



4 

2-4 0-4 


if jump 

POP 

Execute stack action of RETOBN, but do 






not change program counter setting 

to 

be determined 


48' 

TOS 

Set stack pointer to new value, value found 
in register 

1 

0-1 


24 

FIX 

Convert operand found in fl. pt. reg 
to integer. Result to integer register- 

4 

3-4 0-4 


24- 

FLOAT 

Current operand in int. reg. to floating. 




i 


result to fl. pt. reg. 

4 

0-4 1-4 


24 

FIXD 

Convert operand found in fl. pt. register 
to integer, result to two successive 
integer registers 

5 

3-5 0-5 


24 

INFZ 

Convert operand in fl. pt. reg. to zero if 
infinitesimal 

1 

0-1 


24 

SETFL 

Set infintesimal control bit. Underflow 
will thereafter create infinitesimals 

1 

0-1 


24 
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TABLE 2-10 (cont.) 



No. 

Unit Busy 




Clock 


Flt'g 


Instr. 


Description Periods 

Int 

Point 

Mem 

Length 

SETZ 

Reset infinitesimal control but, U'flov/ 
will thereafter create zeroes 

1 


0-1 


24 

PAK2 

Take two floating point registers, round 
the value found in each to 24 bits length, 
concatenate the result, store in memory. 
The original operands are saved as long 
as the third register is distinct 

9 

6-7 

0-6 

6-9 

48 

PAKI 

Take two integer registers, move one to the 
first half, and the other to the second 
half of a 48-bit word which is then 







stored in memory 

2 

0-2 

1-4 

1-4 

48 

PAKID 

Same, except that two pairs of integer reg- 
isters hold 32-bit integers each, ^ich are 
truncated (off left end) to 24 bit integers 
before packing 

4 

0-4 

2-7 

4-7 

48 ‘ 

PAKI3 

Pack three 16-bit integer registers in a 
single word vhich is then stored to memory 

5 

0-5 

2-8 

5-8 

48 

UPI 

Move the two 24-bit halves of a word 







fetched from memory to the pairs of regis- 
ters indicated by the two integer reg. 
addresses 

5 

3-5 

2-4 

0-3 

48 

UPI3 

Move the three 16-bit fields of a word 
fetched from memory to the three int. 
registers addressed. Like PAKI3, may be 
used to keep an index value, its increment 
and its limit packed into a single memory 
word 

6 

3-6 

2-5 

0-3 

48 

UPF 

Move the 24-bit havles of a word fetched 







from memory to the leading 24 bits of the 
two fl. pt. registers addressed, with zero 
fill 

5 

0-1 

2-5 

0-3 

48 

BDCST 

Broadcast. Iteceive byte serial word from 
the CU and insert it into the processor. 
Timing varies with the destination. 

Case 1. Fl. Pt. register 

7 

7-8 

4-7 


24 


Case 2. Single Int. register 

8 

7-9 

4-7 


24 


Case 3. Double (pair of) Int. reg. 

9 

7-9 

4-8 


24 


Case 4. PEM 

9 


4-7 

6-9 

48 

HVST 

"Unbroadcast” , send word to the control 
unit. From fl. pt. register only. 

7 


4-7 


24 
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Description 


TABLE 2-10 (cont.) 


No. Unit Busy 
Clock Flt'g Instr. 

Periods Int Point Mem Length 


FETCH 


STORE 


Move literal or register to register 
Case 1. Literal or fl. pt. reg. to 
fl. pt. 


Case 2. 
reg. 


Literal or int. reg. to int. 


Case 3. Lit. to fl. pt. or vice versa 
Case 4. Memory to fl. pt. reg. 

Case 5. Memory to int. reg. 

All integers above are 16-bit integers. 

For fetching to pairs of integer registers, 
fetching double-length integers, times 
are: 

Case 6. Fit. pt. bo double integer 


0-1 

0-1 

0-1 

0-3 


0-1 


0-1 

2-3 


0-3 

0-3 


24 

(48 if lit) 
24 

(48 if lit) 
24 
48 
48 


reg's or 

vice versal 

2 

0-2 

0-2 


24 

Case 7. 

Double int. to double int. 

2 

0-2 



24 

Case 8. 

Memory to double int. 

4 

0-4 


0-3 

48 

Store from source to PEM 






Case 1. 

Fl. pt. to memory 

3 

0-1 

0-3 

0-3 

48 

Case 2. 

16-bit integer to manory 

4 

0-1 

1-4 

1-4 

48 

Case 3. 

Double length (32-bit) int. 

to mem 5 

0-2 

2-5 

2-5 

48 


WAIT Cease operations until CU emits "go". 

Takes one clock (at the instruction fetch 
unit), before transmitting the "I got 
here" signal. Takes three clocks for "I 
got here” to echo back from the CU as a 
new setting for the program counter, takes 
5 clocks after that for the first instruc- 
tion to get decoded. Takes only 4 clocks 



if PCR not changed. 

9 

ORIGINAL PAGE IS 

24 

STOP 

Same as WAIT plus reset "enable". "Bie 9 
clocks include the tine to restart the 
program after starting but do not include 

OF POOR QUALITY 

t 


any new setting of the program counter. 

9 

24 

HELP 

PNO 

Same as STOP, plus sends interrupt to CU 
Read processor no. fron backplane into 

9 

24 


integer register 

If processor is above the swithced-out 

1 0-1 

24 


spare, add 1 to the number. 

2 0-2 

24 


In all of the following TN instructions, an option is that the execution may be 
conditional on an additional integer register's last bit. Thus, participation of 
a given processor in a LOADEM or STORM need not use the much slower mechanism of 
executing STOP followed by a subsequent turn on. 
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TABLE 2-10 {cont.) 


Description 


No . Unit Busy 

Clock Flt'g Instr. 

Periods Int Point Mem Length 


LOADEM 

Fetch 1 word from EM, address in pair 
of int. registers, to fl. pt. register. 
After first clock, test "ready" line 
from CU before continuing to count clocks 

13 

0-13 

12-13 

24 

LOADEMM 

Fetch N words from EM address in pair of 
int. registers, to PEM. test "CU ready" 

13+ 

0-13 

13- 

24 


line as above. Memory cycles N times. 

4N 


13+ 



Memory address found in int. reg. 
not in instruction (Note 1). 

(Note 

1) 

4N 


STOREM 

Store 1 word from fl. pt. register to EM. 






EM address in double int. register. 

5 

0-2 

1-5 

24 


STOREMM 


Store N words from PEM to EM. PEM address 5+4N 
is in integer register {Note 1) 


5-5+ 

4N 


SHIFTN Transmit one word from fl. pt. register 

out onto TN after testing "CU ready" 
line. After transmission, test for a 
new turn-on of "CU ready", and receive 
from the line. Ihe time given includes 
the 4 clocks the PE waits vSiile the CU 



sets the TN to a new setting. 

12 


0-12 

24 

EMNO 

Read EM module number into the processor. 
Wait for "CU ready", then transmit to int. 






register. Delays through the wire of the 
PE-to-CU-to-EM-to-PE path are included 

8 

7-8 

6-7 

24 

OFF 

Test bit of int. reg., if ZERO, halt and 
reset "enable" bit 

2 

0-1 


24 

ABS 

Make sign bit of fl. pt reg. positive. 
Case 1. Cperand in fl. pt. reg. 

1 


0-1 

24 


Case 2. Operand from memory 

3 ■ 

0-1 

2-3 0-3 

48 

NEC 

Change sign of fl. pt. reg. 

1 


0-1 

24 


Note 1: Itiese EM instructions, with a streaming of N words per instruction are 
included to assist in evaluating the tradeoff between allowing such an N-word 
streaming of data, and restricting the EM instructions to 1 word each. A number 
of advantages accrue to the limitation to N=l. All of these instructions are 
implemented, but, in the baseline design here presented we have limited the 
machine to N=l. A design option exists to inplement other N up to some large 
limit. See Chapter Six. 
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TABLE 2-11 

CCfNTROL UNIT INSTRUCTIONS 




No. 


1 



CU 


Instr . 


Description 

Clocks 

Memory 

Length 

CADD, CSUB 

Add, subtract integers within the CU 
{32 bits) 

Case 1,. Literal or reg. to reg. 

1 


24 





(48 if lit 


Case 2. Memory to register 

1 

Fetch 

48 

CDV521 

Integer div. of register by 521, result 
to a second register 

9 . 


24 

CMOS 21 

Similar to CDV521 except that original 
number MOD 521 is left in a third regis- 
ter . 

10 


24 

CDVMD521 

Produces both quotient and remainder 
for 521 

11 


24 

CMD512 

Save last 9 bits of one reg. in second reg. 

1 


24 

CDV512 

Shift right 9 places end-off into 2nd reg. 

1 


24 

CMUL 

Multiply two operands together 
Case 1. Literal or reg. by reg. 

3+%N 


24 





(48 if lit. 


Case 2. Memory by register 
N is the bit position of the most signi- 
ficant CNE in the multiplier. Thus, mul- 
tiplying by small positive integers is fast. 

3+5sN 

Fetch 

48 

CDIV 

Divide register by register or literal 

5+N 


24 





(48 if lit. 


Divide register by memory 
A preliminary shift, controlled by the 
number of leading zeroes in divisor and 
dividend, produces all or all but one of 
the zeroes in the quotient before the N 
successive subtractions. 

5+N 

Fetch 

48 

CMCD 

Save remainder from CDIV 
Case 1. Divisor from, register 

6+N 


24 


Case 2. Divisor from memory 

6+N 

Fetch 

48 

INT 

Test bit n of interrupt register, reset it 

10 


24 

MASK 

Set/reset nth bit of mask register 

10 


24 
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TABLE 2-11 {cont.) 



Description 

No. 

CU 

Clocks 

Memory 

Instr . 
Length 

CIADl, CISBl 

Add (subtract) from register 

1 


24 

CSHPD 

Shift reg. by the shift distance 
(literal, or found in 2d reg.) 
end-off 

1 

24 


CSHF 

Shift end-around 

1 

24 


GSHFN 

Shift numeric. If a right shift, fill the 
left with copies of the sign bit. If left, 
the shifted-off bits must all equal the 
retained sign bit, or integer overflow 
is declared. 

3 

24 


TIOM 

Transmit content of two or three registers 
to DBM-EM controller 

2 

Fetch 

24 

CFCH 

Fetch from CU memory to register 

1 

Fetch 

48 

CSTR 

Store to CUM from register 

1 

Store 

48 

CTIX 

Text index in register, and increment 
Case 1. Fall-through 
Case 2. Jump 

3 

7 

24 


TIOH 

Read or write 2 words into 48-bit host- 
readable register, interrupt host 

2 

24 


CGT / OGE 
CLS, CLE 
CEQ, CNE 

Teat register against register 
Case 1. Fall- through 
Case 2. Jump 

3 

8 

24 


CCALL 

Enter subroutine, ignore processors 

20 

24 


CCALLS 

Enter subroutine, synch 

23 

24 


CRET 

Return from subroutine, ignore processors 

30 

24 


CRETE 

Return from subroutine, synch 

33 

24 


UBSCST 

Unconditionally force the processor to 
accept a stream of N words for PEM or 
PEPM with starting address in CU 
register 

6+4N 

Fetch 

during 

inst. 

48 

UBDCSTE 

Same except only enabled processors are 
loaded 

6+4N 

Fetches 

48 


ORIGINAL PAGjE IS 
OP POOR QUALITT 
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TABLE 2-11 (cdnt.) 


No. 

CU Instr . 



Description 

Clocks 

Memory 

Length 

USETP 

Unconditionally force the content of CUM 
into designated processor register. CUM 
address is in instruction stream with 
index option 

4 

Fetch 

48 

USETPO 

Same, plus turn on "enable" bit of the processor 

4 

Fetch 

CHALTP 

Halt PE's at end of next PE instruction. 
Wait for all PE's to finish. Can restart 

4 


24 

CSTOPP 

Stop processors in second clock of this in- 
struction. Cannot restart processors, un- 
til reinitialized 

3 


. 24 

LOADCU 

Fetch to CUM from EM via TN. EM address 
in CU register is DW 521 to make 
addressTwithin-module, and MOD 521 to 
form module no. (which sets the barrel 
part of the TO) . The DIV and MCC are 
computationally expensive, therefore, 
we stream N words. (Note 1) 

26+ 

4N 

(Note 1) 

Series 

of 

Stores 

48 

STORCU 

Store from CUM to EM. Address calcula- 
tion like LOADCU. N words (Note 1) 

26+ 

4N 

(note 1) 

Series 

of 

Fetches 

48 

I/3ADSCU 

Same as IDADCU except the destination 
is the register, rather than memory 
pointed to by the register 

23 


48 

STOBRCU 

Same as STORCU except the data is taken 
from the reg. rather than memory 

23 


48 

CFETCH 

Fetch from CUM to address indexable by 
register 

1 

Fetch 

48 

CSTORE 

Store to CUM from register 

■ 1 

Store 

48 

GJUl® 

Change PCR setting 

1 


24 


LOADEM Set TN to settings found in register (RDM 

for log3 (skip-distance) is. in hardware). ORIGINAL PAGE IS 

Send "CU ready" bit to processor. When "all qj' POOR QUALITY 

processors ready" comes back, send N successive 
"read" commands to EM at 4 clock spacing. 

(See Note 1) Includes TN setting for 
broadcasting to all processors for one EM 

module. 4+4N 24 

STOREM Same, except "write" command sent to EM. 8 24 
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TABLE 2-11 (cont.) 



Description 

NO. 

CU 

Clocks 

Memory 

SHIETN 

Set IN setting and send "CU ready". 

When "all processors ready" comes back, 
wait 1 clock, set TN to 2d setting, 
and send "go" . 

8 

24 

EMNO 

Set TN setting and send "CU ready". When 
"all processors ready" comes back, send 
"read module no." to EM and "go" to pro- 
cessor, appropriately timed. 

6 

24 

CGTS, CGES, 
CLSS, CLES, 
CEQS, CNES 

Perform indicated test and wait for "all' 
processors ready". Then send command to 
processors to load PCR to either first or 
second address depending on the test result. 
Also branch in CU if- test succeeds. 

6 ■ 

24 

CTIXS 

Test index against liiit and wait for "all 
processors ready". Hien jam 



CILIT 

16-bit literal to int. reg. 

1 

24 

CLITT 

Transfer 32-bit literal to CU. reg. 

2 

48 

CALIT 

fidd 32-bit literal to CU reg. 

2 

48 

SETTN 

Set TN controls. No synchronization or 
processor interaction occurs 

4 

24 

LOOP 

Wait till "all processors ready". If 
any are .enabled issue "go". If none 
are enabled, jam processor PCR to new 
setting found in address field. Used for 
synchronized execution of loops whose loop 
control is in a processor variable, and may 
be data dependent per processor. 

2 

24 

SYNCH 

Wait for "all processors ready". Issue 
"go" 

2 

24 


Instr . 
Length 
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TABLE 2-11 (cont.) 




No. 





CU 


Instr . 


Description 

Clocks 

Memory 

Length 

BDCST 

Wait for "all processors ready", then trans- 
mit byte-serial word and "go". 

Case 1. Word comes from CU register 

5 

24 



Case 2. Word comes- from CUM 

5 

Fetch 

48 

HVST 

Wait for "all processors ready" then trans- 
mit "go", receive 48-bit word (If PE is 
transmitting an integer , later bytes may 
be empty except for the check bits) 

9 

24 


CAND, COR 

Logic combination of two CU words, result 
to register. 

Case 1. Both operands in registers or lit. 

2 

24 



Case 2.' One operand from CUM’ 

,2 

Fetch 

48 

CNCT 

Bit complement of CU register 

2 

24 


CIMP 

A and not B. Logic 

Case 1. Both operands register or literal 

2 

24 



Case 2. One operand' from CUM 

2 

•Fetch 

48 

MOVE 

Register-to-register move 

1 

24 


CBIT,CBITS 

Jimp if any bit of register, ANDed with 2nd 
register or literal is CW 

6 

24 



Note 1: Ihese EM instructions, with a streaming of N words per instruction are included 

to assist in evaluating the tradeoff between allowing such an N-word streaming of data, and 
restricting the EM instructions to 1 word each. A number of advantages accure to the 
limitation to N=l. All of these instructions are implemented, but, in the baseline 
design here presented we have limited the machine to N=l. A design option exists to 
implement other N up to some large limit. See Chapter Six. 
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TABLE 2-12 

FLOATING POINT SCALAR INSTRUCTIONS 



Description Clocks 

Memory 

instr. 

Length 

ADD, SUB 

Case 1. Reg. or lit. + reg. to reg. 

6 


24 (48 if lit.) 


Case 2. Reg. + mem. to reg. 

6 

Fetch 

48 


MUL 

Case 1. Reg. x reg. or lit. to reg. 

9 


24 

(48 if lit.) 


Case 2. Reg. x mem. to reg. 

9 

Fetch 

48 


DIV 

Case 1. Reg. or reg -./I it to reg. 

44 


24 

(48 if lit.) 


Case 2. Reg ./mem. to reg. 

44 

Fetch 

48 


DIVR 

same as DIV with operands reversed. 
Case 2 only. 

44 

Fetch 

1 48 


MAD 

Case 1. Reg. x reg. or lit. + reg. to 
reg. 

11 


24 

(48 if lit) 

SSQ 

Case 1. Reg. 2 + Eeg. 2 reg. 

21 


24 



Case 2. Mem. 2 + reg. 2 to reg. 

2i 

Fetch 

48 


ADDD 

Floating point double length addition 

13 


24 


MULD 

Floating point double length multiply 
capability (single length inputs) 

17 


24 


LT, LE, or. 

Tests on floating point registers 

2 


48 

if fall thru 

GE, BQ, NE, 
INFY, INFL 

■ 

4 


if 

jump 

FIX, FLOAT 

•Convert data type 

4 


24 


INFX 

Convert infinitesimal to zero 

1 

24 



SETFL, SETZ 

Set response to underflow to infintesimal 






or zero 

1 


24 


PAK2 

Pack two truncated fl. pt. words in mem. 
word. 

6 

Store 

48 


UPF 

Unpack two truncated fl. pt. words 

2 

Fetch 

48 


PENd 

Load CU register with predetermined lit. 
Supplied only to permit symmetry with 
processors' code stream. 

1 

24 



ABS 

Take absolute value. 
Case 1./ reg./ 

1 


24 



Case 2. /mem./ 

1 

Fetch 

48 


NEG 

Change Sign 

1 


24 



2-65 



TABLE 2-13 

OFFSET TIMES OF PEDCESSOR-CU SYNCHimiZED INSTRUCTIONS 


it^STRUCTICW OR ACTICN 

PROCESSOR INSTRUCTICN 

Tl 

Interrupt 

HELP 

-3 

LOADEM 

LOADEM 

1 

STOREH 

STOREM 

1 

SHIFTN 

SHIFTN 

3 

EMNO 

EMNO 

’ 1 

BDCAST 

BDCAST 

-3 

HVST 

HVST 

-3 

SYNC 

WAIT 

-3 

CGTS, OGES> CLSS 
CLES, CEQS, CNES, 
CTIXS, CJUMPS 
CBITS 

WAIT ■ 

-3 

CCALIS 

STOP or WAIT 

-3 

CRBTS 

STOP or WAIT 

-3 

LOOP 

WAIT 

-3 
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Ref. 1. Burroughs Corporation, "Final Report, Numerical Aero- 
dynamic Simulation Facility, Preliminary Study", Dec. 1977. 
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CHAPTER 3 
SOFTWARE ISSUES 


3.1 EXTENDED FORTRAN FOR THE FMP 

3.1.1 INTRODUCTION 

This chapter describes the extensions and restrictions on the FMP 
FORTRAN language and compiler at the functional level. The 
overall functional view of this piece of software is stated below, 
and is sketched in Figure 3-1. 

1. NSS FORTRAN will be as compatible with ANSI FORTRAN 
(X3J3/90) and B7800 FORTRAN as the architecture permits. 
Differences from these standards will be indicated in this 
document and in detail in the later detailed design 
specification. 

2. The compilation process will be performed on the B7800 
■front end and will produce code to be executed on the FMP 
system. 

3. FMP FORTRAN will have array operations designed to allow 
the explicit expression of parallel operations available with 
the architecture. 

4. The compiler will be designed in a modular fashion with 
an internal representation between components which is 
identical so that addition modules can be added if desired.' 
The components as envisioned at this time are: 


a. A parser 

b. A preliminary optimizer which performs standard serial 
optimization techniques. 

c. A secondary optimizer which may reorder code to obtain 
maximum overlap of functional units. 
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Figure 3-1. FMP Compiler Components 
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d. A code generator 

e. A source regenerator which will regenerate serial 
FORTRAN as a method of enhancing portability and 
providing the user with a programming tool during the 
early phases of using the machine. 

3.1.2 Functional Objectives of Language Development 

In the development of the FMP language and the FMP compiler the- 
following goals were set which are listed below: 

1. Allow the user to access features of the machine in a 
simple straight forward manner. 

2. Add a small number of extensions which are general in 
nature rather than a host of specific cases. 

3. As much as is possible keep both the syntax and semantics 
of the extensions isolated from those employed in serial 
FORTRAN constructs. 

4. Provide easily understood and recognizable constructs 
which yields programs which the user can understand and ‘ 
recognize without translation back to serial constructs. 

3.1.3 Major Extensions to FORTRAN 

There are only two primary extensions to the ANSI FORTRAN. All 
other additions and restrictions to the language follow from these 
primary extensions. The two consist of a modification to the 
normal set of non-executable specification statements and the 
addition of a parallel construct. 


ORIGINAL PAGE IS 
OF POOR QUALTIYi 
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The modifications in the specification statements are made to 
allow the user to control the memory allocation to maximize 
efficient utilization of the machine. These memory resident 
specifications allow the user ’ to explicity control the allocation 
of his data among the Control Unit Memory (CUM), the Extended 
Memory (EM), and Processor Memory (PM). The second construct is a 
parallel construct put in the language to aid the user in 
obtaining a simple way in which to express the parallel aspects of 
his problem. With both constructs equivalences can be made to 
ANSI FORTRAN so that a serial FORTRAN can be regenerated. 

3.1.4 Specification Statements 

The modifications to FORTRAN will permit the following 
specifications ; 

1. DIMENSION 

2 . EXTENDED 

3. LOCAL 

4 . GLOBAL 

For the present the following statements will be disallowed; 

1. EQUIVALENCE 

2. COMMON (Blank or named) 

3. 1.4.1 The DIMENSION statement retains its ANSI FORTRAN meaning! 
The DIMENSION statement is used to specify the sumbolic names and 
dimension specifications (extents) of arrays. 

3.1. 4.2 The EXTENDED Specification statement declares that the 
variables specified in the statement are resident in the Extended 
Memory. The form of declaration is; 
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EXTENDED /cb/ nlist (, /cb/ nlist) 

or 

EXTENDED nlist 
where ^ is an extended block name 

nlist is a list of variable names or array declarators. Only one 
appearance of a symbolic name as a variable name or array 
declarator is permitted in all such a symbolic name as a variable 
name or array declarator i-s permitted in all suchlists in a 
program unit. The ellipses represent repetition. 

This construct is similar to blank COMMON in the sense that execu- 
tion of a RETURN or END statement never causes these quantities to 
become undefined-. (See Specification FORTRAN X3J3/90 page 8-3) 

3. 1.4. 3 The LOCAL specif ica-tion statement declares that the 

variables specified in the statement are resident in Processor 
Memory. The form of the declaration is: 

LOCAL /cb/ nlist (, /cb/, nlist) 

or 

LOCAL nlist 

where cb and nlist are defined as above. 

This construct is similar to named COMMON in FORTRAN in the sense 
that execution of a RETURN or END may cause the quantities to be 
undefined. Note however that execution of a RETURN or END ‘within 
a subprogram will not cause entries to become undetermined in a 
LOCAL block that appears in the subprogram and appears in at least 
one other program unit that is referencing it either directly or 
indirectly. (See Specification FORTRAN X3J3/90 page 15-15) 
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3. 1.4. 4 The GLOBAL specification statement declares that vari- 
ables specified in the statement are controlled by the Control 
Unit and are broadcast automatically to the Processor Memory on 
Program initiation or if they modified during the execution of a 
program. The form of the declaration is: 

GLOBAL /cb/ nlist {/ /cb/ nlist) 

or 

GLOBAL nlist 

where cb and nlist are defined as above. 

3.1.5 The Parallel Construct 

The executable DOALL construct is a control statement provided to 
permit concurrent execution of 'segments of a program. 

The DOALL statement is used in conjunction with a terminal 
statement ENDDO to form together a loop called the DOALL loop. 

The form of these two statements is 

DOALL, I=Ii, l 2 (, 13 ) J2(.J3)) (;K=Ki, K 2 (;K 3 )) 

ENDDO 


I is the name of an integer variable. , I 2 , I 3 are each 
integers . 

3. 1.5.1 Range of a DOALL loop. The range of a DOALL loop 
consists of all executable statements that appear following the 
DOALL statement including the terminal ENDDO statement. 





No additional DOALL statements may occur within the range of a 
DOALL. 


If a DO statement appears within the range of a DOALL statement it 
must be fully contained within the range of the DOALL statement. 

If a arithmetic or logical IF statement occurs within a DOALL 
statement, it may not transfer control out of the range of the 
DOALL statement. Transfer into the range of a DOALL is 
prohibited . 

3. 1.5. 2 Active and inactive DOALL-loops. A DOALL loop is either 
active or inactive. Initially inactive, a DOALL becomes active 
only when its DOALL statement is executed. 

Once active, the DOALL-loop becomes inactive only when the 
iteration count (3. 1.5. 4) for each of its increment parameters 
becomes zero. 

Execution of a FUNCTION reference or a CALL statement that appears 
in the range of a DOALL statement does not cause the DOALL to 
become inactive. Note specification of an alternative return 
specifier outside the range of the DOALL is disallowed. 

3. 1.5. 3 Incrementation Parameters. Specified in the DOALL 
statement are at least one set of parameters which are to control 
the execution of the statements within the range of the DOALL 
loop. These are called the incrementation parameter set and there 
may be a total of three sets of them. Each parameter set consists 
of three (four) integers known as the DOALL variable, the initial 
parameter, the terminal parameter, and (the increment parameter). 
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3. 1.5. 4 Referencing the DOALL variable within the DOALL loop. 
References to the DOALL variable, I, (J) or (K) within the 
DOALL-loop is permitted for the following references: 

1. Any reference to array subscripts for arrays declared to 
be in Extended Memory, however, the DOALL variable may 
not reference outside the declared array. 

2. Any reference to the value of the DOALL variable within 
an expression of an IF statement if control is hot trans- 
ferred beyond the range. 

3. The DOALL variable may be used in the evaluation of an 
assignment statement, however, not to form forbidden 
array reference. 

The utilization of the DOALL variable is specifically prohibited 
for the following: 

1. Any reference to array subscripts for variables declared 
to be LOCAL or which appear in a DIMENSION statement 
either explicitly or implicitly. 

2. The DOALL variable may not be reassigned within the range 
of the DOALL-loop except by the DOALL statement. 

'3. Transfer of control into the range of a DOALL-loop is 
prohibited. 

3. 1.5. 5 Execution of the DOALL construct. The effect of execut- 
ing a DO-ALL-loop construct is to execute all body statements, 
those following after the DOALL statement and preceding the ENDDO 
statement, in a serial fashion for those determined incrementation 
parameters set in the DOALL statement. The initial parameter M]^ 
the terminal parameter M2, and the incrementation' parameter M3 are 
determined for each incrementation set, I3, l2r I3. This deter- 
mines the allowable values of the DOALL variables I(J and K) equal 
to Nj. 
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The DOALL variable I with its Nj allowed values is paired with the 
first allowed variable of J. Next the DOALL variable of I with 
its Nj allowed values is paired with the second allowed variable 
of J. This continues until all possible combinations occur. The 
total number of combinations is: 

Nj for a single DOALL-loop incrementation set 

Ni * Nj for a double DOALL-loop incrementation set 

Nj * Nj * Nj^ for a triple DOALL-loop incrementation set 

Hence the body statements are executed in serial fashion for each 
given set of DOALL variables allowed, either I, I S J, or I, J, & 

K in a strictly parallel sense. 

3.1.6 Subroutines & Procedures as Program Subunits {to be resolved 
in Phase II) 

3.1.7 Other Constructs 

3. 1.7.1 ASSIGN Statement. The ASSIGN statement has been dropped 
as a possible candidate for a FMP extension. It was found that 
the access to Extended Memory could be handled by simple compiler 
algorithms through the EXTENDED declaration. It was found that in 
complex control structures the programmer was more likely to make 
mistakes and cause ARRAY bound errors than if the compiler was to 
perform all the necessary accessing. Some details of this will be 
shown in later examples. {See 3. 2. 2. 2 discussion and Fig. 3.4). 

3. 1.7. 2 I/O. All I/O for NSS FORTRAN must be performed on vari- 
ables assigned to Extended or Control Unit Memory. If variables 
in Processor Memory are referenced in an I/O statement a 
syntactical error will result. 



3.1.8 Examples of Constructs in FMP FORTRAN 


3.1. 8.1 VALID Triply Nested DOALL-Loop 

EXTENDED Q(100, 100, 100), S(100, 100, 100) 
DOALL, I = 2, 99; J = 2, 99; K = 2, 99 

RR = 1.0/Q(I, J+1, K-1) 

Rj = Q(I+1, J, K) - Q{I-1, J, K) 

Ry = Q(I, J, K+1) - Q{I, J, K-1) 

S(I, J, K) = RR * * R2 

ENDDO; 

2 . INVALID 

EXTENDED Q(100, 100, 100), S(100, 100, 100) 
DIMENSION Ri(lOO), R2(100) 

DOALL, I = 2, 99; J = 2, 99; K = 2, 99 

RR = 1.0/Q(I, J+1, K-1) 

%(!) = Q {I+l, J, K) - Q(I-1, J, K) 

R2(I) = Q('I, J, K+1) - Q(I, J, K-1) 

S(I, J, K) = RR * R^d-l) * R2(I+1) 

ENDDO; 


This construct is invalid because the arrays Rj -and R 2 declared in 
the DIMENSION Statement are referenced by the DOALL variable I. 

If it is necessary to so reference the arrays Rj and R 2 arrays the 
doubly nested DOALL construct should be used (see 3. 1,8. 2). 
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3 . VALID 


EXTENDED Q(100, 100, 100), S(100, 100, 100) 

DOALL, I = 25, -50, 2; J = 1, 99; K = 2, 100 

RR = 1.0/Q(I, J+1, K-1) 

IF (I. GT. 30); GO TO 1 

Rl = Q(I+1, J, K) - Q(I-1, J, K) 

S(I, J, K) = RR * Rl 
GO TO 2 

1 Rl = Q(I-1, J, K) - Q(I+1, J, K) 

S(I, J, K) = RR * Rl 

2 CONTINUE 
ENDDO; 

4.. INVALID 

EXTENDED Q(100, 100, 100), S(100, 100, 100) 
DOALL, I = 25, 50, 2; J = 1, 99; K = 2, 100 

RR = 1.0/Q(I, J+1, K-1) 

IF {I. GT. 30) GO TO 1 

Rl = Q(I+1, J, K) - Q(I-1, J, K) 

S(I, J, K) = RR * Rl 

ENDDO; 

1 CONTINUE 
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This DOALL-loop construct is invalid because it transfers control 
out of the range of the DOALL. 


5. INVALID 

EXTENDED Q(100, 100, 100), S(100, 100, 100) 
DIMENSION Ri(lOO), R2(100) 

GLOBAL Jl, Kl 

DOALL J=2, Jl; K=2, Kl 

Rl (I) = 6.7 

If (J 30) GO TO 3 

If (K 30) GO TO 4 

DO 1 I = 2, 99 

RR = 1.0/Q(I, J, K) 

GO TO 5 

3 RR = 1.0/Q(I, J-1, K) 

GO TO 5 

4 RR = 1.0/Q(I, J, K-1) 

5 Ri(I) = Q(I+1, J, K) - Q(I-1, J, K) 

R2(I) = Q(I/ J, K+1) - Q(I, J, K-1) 

S(I, J, K) = RR * R^d-l) * R2(I+1) 

1 CONTINUE 
BNDDO; 

ANSI FORTRAN specifically prohibits transfer of control from 
outside a DO-loop to into the body statements of a DO-loop. 
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3.1,. 8.2 Doubly Nested Loops 


1 . VALID 

EXTENDED Q(100, 100, 100), S(100, 100, 100) 
DIMENSION RjdOO), R2(100) 

DOALL, J=2, 99; K=2, 99 

Rl(I)=6.7 

DOi 1=2,99 

RR=1.0/Q(I, J+1, K-1) 

RlU) = Q(I+1, J, K) - Q{I-1, J, K) 

R2(I) = Qdr J, K+1) 

S(I, J, K) = RR * Rid-1) * R2(I+1) 

1 CONTINUE 

ENDDO; This is the correct syntax for handling the 
problem in Example 2. (3. 1.8.1) 

2. VALID 


EXTENDED Q(100, 100, 100), S{100, 100, 100) 
DIMENSION Ri(lOO), R2(100) 

GLOBAL Jl, Kl 

DOALL, J=2, Jl; K=2, Kl 

Rl(I)=6.7 

DO 1 I = 2, 99 

If (J.GT.30) GO TO 3 

If (K.LT.30) GO TO 4 

RR=1.0/Q(I, J, K) 

GO TO 5 

3 RR=1.0/Q(J, J-1, K) 

GO TO 5 
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4 RR 1.0/Q(I, J, K-1) 

5 Ri(I) = Q(I+1, J, K) - Q(I-1, J, K) 

R2(I) = Q(I, J, K+1) - Q(I, J, K-1) 

S(I, J, K) = RR * Ri(I-l) * R2(I+1) 

1 CONTINUE 
ENDDO; 

3. 1.8. 3 Use of the LOCAL Construct 
1. VALID 

EXTENDED Q{100r 100, 100), S(100, 100, 100) 
LOCAL Ri(100, R2(100), -CONST 
GLOBAL JL, JK) 

DOALL, J=1,JL; K=1,KL 
R(l)=6.0 
R(100)=10.0 
DO 1 I = 2, 99 
RR=1.0/Q{I, J, K) 

%(!) = Q(I+1. J, K) - Q(I-1, J, K) 

R2(I) = Q(I, J, K+1) - Q(I, J, K-1) 

CALL TEST {I) 

S{I, J, K) = RR * R(I-l) * R2(I+1) * CONST 
1 CONTINUE 
ENDDO; 

SUBROUTINE TEST( 1) 

LOCAL Rl(lOO), R2(100), CONST 
IF (R1(I). GT. R2(D) CONST=Rl(I) 

' RETURN 
END 



2 , 


INVALID 


EXTENDED Q{100, 100, 100), S(100, 100, 100) 

LOCAL Ri(lOO), R2(100) 

GLOBAL JL, JK 
DOALL, J=1,JL; K=1,KL 
R(l) = 6.0 
R(IOO) = 10. 0 
DO 1 I = 2, 99 
RR=1.0/Q(I, J, K) 

Rl(I) = Q(I+1, J, K) - Q(I-1, J, K) 

R2(I) = Q(I, J, K+L) - Q(I, J, K-1) 

CALL TEST(I) 

S(I,J,K) = RR*R1(I-1)*R2(I+1)*C0NST 
1 CONTINUE 
ENDDO; 

Using the identical SUBROUTINE TEST above v?ould cause an undefined 
reference to CONST because the LOCAL declaration does not contain ■ 
the variable CONST. Naturally, TEST could have been defined with 
two parameters I and CONST, which would have been valid. 
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3.2 HAND COMPILATION FOR SAM 
3.2.1 Overview 

The methodology of hand compilation for the SAM will be described 
through a series of examples each of which will be transformed in 
a series of stages from original FORTRAN to ASSEMBLER CODE,. 
References will be made to Appendix (A) which discusses 
preliminary compiler alogrithms for setting the transposition 
network . 

In each example the following code steps will be taken: 

1. Original NASA-AMES FORTRAN 

2. Extended FORTRAN for SAM 

3. Compiler output including code reorganization (written in 
a Pseudo FORTRAN 

4. Compile output showing Transposition Network and Memory 
Module computations (again in a pseudo FORTRAN or META 
ASSEMBLER) 

5. ASSEMBLER CODE - 


The example chosen from the Explicit Code was the SUBROUTINE 
TURBDA because it demonstrates the ability of SAM to operate in a 
concurrent manner and provides a vehicle for demonstrating the com- 
piler's ability to handle control statements through a "mimicking" 
technique and also provides an example of why it is felt that an 
ASSIGN statement could cause programmer error. The second example 
is the major LOOPS of the SUBROUTINE STEP including the subroutine 
calls and the called SUBROUTINES BTRI and XXM. One loop (DO 20) 
will be discussed in detail while the other two (DO 30) and (DO 
40) will show the differences in the transposition network 
settings and the memory module accesses for the different memory 
accessing. (D030 & D040 discussion to be supplied later) . 


3-16 



3.2.2 SUBROUTINE TURBDA 

3.2.2,. 1 Original Code and SAM Extended FORTRAN , 

In Figure 3-2 the original NASA-AHES version of the SUBROUTINE is 
shown. The FMP Extended FORTRAN as written by the prograinmer is 
given in Figure 3.3. In both cases the coininon declarations were, 
modified slightly to remove extraneous variables from this 
specific example. As you will note, the programmer wrote a two 
dimensional DOALL-loop with a serial inner DO loop. Because there 
is no data depending on I it could have been written as a three 
dimensional DOALL. 

3. 2. 2. 2 Preliminary Code Analysis 

Figure 3-4 shows the preliminary compiler code analysis. Within 
the DO 1 loop .the compiler determines what array elements stored 
in Extended Memory must be fetched through the Transposition 
Network. For a given I, J, K, EI(I, J, K) must be fetched. 
However, only for J=1 must the element EI(I, J+1, K) and for K=1 
must the element EI(I, J, K+1) . The compiler will be capable of 
recognizing these accesses to extended memory and will "mimic" the 
branch structure. It also will be able with this mirroring of the 
original structure be able to access only the requisite elements ' 
and prohibit out of bounds access of the array even if those 
elements are no.t subsequently used. This protection is even more 
critically necessary ;rfien accesses occur in the negative sense 
rather than the positive one as in this example. 
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SUBROOTINE TDRBDA 

COMMON/A12/ RH0W(31,31,31) ,E{31,31,31) , £1(31,31,31) 
COMMON/A5/ IL,JL,KL,CV 
COMHON/A6/ RHUL(31,31,31) 

CVl-l./CV 
DO 1 K=1,KL 
DO 1 J=1,JL 
DO 1 1=1, IL 

TEMP=ABS{ El ( I, J ,K) )*CV1 

IP(K.EQ.l) TEMP=.5*ABS(EI(I,J,1)+BI{I,J,2) )*CV1 
IF(J.EQ.l) TEHP=.5*ABS(EI(I,1,K)+BI(I,2,K) )*CVl 
RHUL( I , J,K) =2. 207E-08*SQRT ( TEHP**3 )/TBMP+198 . 6 ) 

1 CONTINUE 
RETURN 
BHD 

Figure 3-2. Original NASA- AMES FORTRAN 


SUBROUTINE TDRBDA 

EXTBNDED/A12/ RHOW{ 31,31,31) ,E( 31,31,31) ,E1(31 ,31,31) 
GLOBAL/A5/ IL,JL,KL,CV 
EXTENDED/A6/ RMDL( 31 , 31, 31) 

CV1=1./CV 

DOALL, J=l,JLfK=l,KL 
DO 1 1=1, IL 

TBHP=ABS(EI(I,J,K)) *CVl 

IF(K.EQ.l) TEHP=.5*ABS(EI(I,J,1)+EI(I,J,2) )*CVl 
IF(J.BQ.l) TEMP=.5*ABE(EI(I,1,K)+EI(I,2,K) )*CV1 
RMUL(I,J,K)=2.270E-08*SQRT(TEMP**3)/TEHP+198.6) 

1 CONTINUE 
ENDDO; 

RETURN 

END 

Figure 3-3. Extended FORTRAN for SAM 


3-18 



SUBROUTINE TURBDA 

EXTENDED EI{31,31,31) , RMUL( 31 , 31 , 31) 

GLOBAL CV,JL,KL,IL 
DOALL, J=1,JL;K=1,KL 
CVL - 1.0/CV 
DO 1 1=1, IL 
El =EI(I,J,K) 

FOR{J,NEQ.l) null fetch next line 
E2 =EI(I,J+1,K) 

POR(K.NEQ.l) null fetch next line 
E3 =EI(I,a,K+l) 

IF(J.EQ.l) GO TO 3 
IF(K.BQ.l) GO TO 2 
TEHP=ABS{E1 )*CV1 
GO TO 4 

2 TEMP= 0.5*ABS(E1+E3 )*CV1 
GO TO 4 

3 TEHP=0.5*ABS(E1 + E2 )*CV1 

4 RMUL(I,J,K) - 2,270E-08*SQRT{TEHP**3)/(TBMP+198.6) 
1 CONTINUE 

ENDDO 

RETURN 

END 


Note: The expression "Null fetch next line" implies that the 

transposition network will be set to fetch all the elements for 
BI(I,J+1,K) for given 1. However only those for which J=1 will in 
fact be passed from Extended Memory to the Processors. 

Figure 3-4. Compiler Code Analysis 
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As one can see in this example all processors for which J^l & Kj^l 
all execute TEMP=ABS(E1*CV1. All processors for which J=1 
(including K=l) compute TEMP=0.. 5* ABS{ E1+B2 )*CV1 . All processors 
for which K=1 and J=1 form TEMP=0 . 5*ABS( E1+E3) * CVl. These three 
cases occur for a given I concurrently. 

3. 2. 2. 3 Computer Programmatic Transformations Including 
Transposition Network Calculations 

Figure 3-5 shows the Control Unit and Processor Element code 
streams in a FORTRAN like language or META ASSEMBLER. The 
compiler recognizing the two dimensional DOALL on J,K, which are 
the second and third indices of Extended arrays El and RMUL and 
calculates the number of cycles to be performed (the DO 10 loop) 

i.e. NMAX = (I SECONDSIZE*THIRDSIZE + Nprocessors -1) 

Nprocessors 

= ( 31*31 + 512-1 ) = 2 
512 

Similiarly the compiler recognizes that ISKIP=IFIRSTSIZE=31. Note 
that all accesses to El and RMUL are of type 1 as described in 
Appendix A. 
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CU INSTRUCTIONS 


PE INSTRUCTIONS 


ENTER TURBDA 1 ENTER TURBDA 




2 



CV1=1.0/CV 


DO 10 N=l,2 

3 



DO 10 N=l,2 


IW=512*N-512 

4 



IW=512*N-512 



5 



IV= IW+PENO 



6 



KM1=IV/31 



7 



K = KMl+1 



8 



J = IV-KMl+31+1 


IN= IW*31 

9 



IN= IV* 31 


IA01=IBSET+IN 

10 



IA01= IBSEI+IN 


IA02=IAf(l+31 

11 



IA02- IAj01+31 


IA03=IA0+961 

12 



IAO3=IA0+961 


IA04=IBSRH+IN 

13 



IA04= IBSRM+IN 


DO 1 1=1, IL 

14 



DO 1 1=1, IL 


II=l7l 

15 



11=1-1 


0FFSET1=M0D( IA01+II,52i) 

16 



MADD1= (IA01+I1)/521 



17 

SYNCH 





18 



FOR (J.NE.l) MODE=0 


OFFSET2=HOD ( IA01+II , 5 21 ) 

19 



MADD2= (IA02+II)/521 



20 

SYNCH 





21 



FOR (K.NE.l) MODE=0 


OFFSET3=MOD ( IA0 3+11,521) 

22 



HADD3= (IA03+ID/521 



23 

SYNCH 





24 



IF {J.GT,JD GO TO 8 



25 



IF (K.GT,KL) GO' TO 8 



26 



IF (J.EQ.l) GO TO 2 



27 



IF (K.EQ.l) GO TO 3 



28 



TEMP=ABS{E1)*CV1 



29 



GO TO 4 



30 


2 

TEMP=0 . 5* ABS ( E1+E3 )* CVl 



31 



GO TO 4 



32 


3 

TEHP=0 . 5 { ABS ( E1+E2 ) * CVl 



33 


4 

R=2.27OE-08*TEMP 



34 



* SQRT ( TEMP ) / ( TEHP+1 9 8 . 6 ) 


0PPSBT4=H0D( IA04+II,521) 

35 



MADD4=(IA04+II)/52I 



36 


8 

CONTINUE 



37 

SYNCH 



1 

CONTINUE 

38 


1 

CONTINUE 

10 

CONTINUE 

39 


10 

CONTINUE 

EXIT 

40 


EXIT 


Note; The Expression Mo3e ^0 is merely a device used to imply that for those 
values of the variable not equal to 1 fetches through the Transposition 
Network do not occur. 


Figure 3-5. Compiler Output with Transposition Calculations 


ORIGINAL PAGE IS 
OF POOR QUALM 
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On entering the subroutine (line 1) of Figure 3.5 each processing 
element calculates CVl (line 2). Loop 10 is then initiated which 
represents the number of times the array must be cycled as 
mentioned above (line 3). Next IVV is calculated which repre- 
sents the number of processors that have been utilized to that 
cycle number. Obviously the compiler does not perform 512*N-512 
but rather start from zero and increment by 512, however, FORTRAN 
usage was utilized here. The processing elements then perform a 
number of calculations (line 4 - line 8). IV=IW+IPENO represents 
the address in J,K space that each processing element has. From 
that number its J and K value is determined (line 7 and line 8). 

KMl (line 6) which represents the K value minus 1 which is used in 
the J calculation is calculated separately. 

Lines 10 thru 13 represent address calcuations. For the control 
unit one is calculating the address of the array element which is 
to. go into processing element 0 for each transposition network 
setting, i.e. THE OFFSET. The processing element it is performing < 
and address calculation on the specific array element. This is 
why line 9 has different determinations for IN. Lines 10 thru 13 
are address calculations for EI(I,J,K) (line 10) EI(I,J+1,K) (line 
11), EI(I,J,K+1) (line 12) and RMUL(I,J,K) (line 13). Note line 
10 and 13 start from the base- address IBSET of El and IBSRM of 
RMUL. The CU instructions are computing the address calculation 
for the array element which is to go to processor -0 while the 
processors are calculating the address of the array element to go 
to Processor = IPENO. 
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Note all these index computations are performed only for the outer 
loop. They do not occur for the inner D01I=1,IL loop (line 14). 

Next the I index is decremented by 1 (line 15), again a FORTRAN 
antifact, which would not occur in the ASSEMBLER code but this is 
FORTRAN. The memory module address, MADDl (line 16) iS' computed 
in the processing element while the offset, IFSETl (line 16) is 
computed by the mod function in the control unit. The array and 
the control unit now SYNCHRONIZE. In a- similar fashion in the 
offset and memory module address are calculated for each of- the 
next two array access and synchronized accordingly (lines 18 thru 
28). Note that for (J.NE.l) (.line, 18) a mode bit is set .which 
turns off the array fetch. Similarily for (K.NE.l) (line 21). 

The next step the compiler takes is to skip computations for those 
values of J between JL+1 and 31, the value declared for the array 
in the EXTENDED declaration (line 24). This is the way the 
preliminary compiler is going to handle the one dimensional vector 
length/declared extent problem at this juncture. Alternative 
algorithm are known; however teaching the algorithms and 
subsequent hand compilation would require Burroughs more effort 
than the possible machine performance degradation, that might occur, 
during simulation.. For (K.GT.KL) a similar branch is performed 
(line 25). Note that 8 CONTINUE must be above the next 
synchronization point. Next the branches for sections- of code 
which will be computed for (J-.EQ.l) , ((K.EQ.l). AND (J.NEQ.l) ) 
and for ail other J and K values less than JL and KL. (lines 26 
thru 32) All processors except those that have J or K values 
greater than JL or KL then process lines (33,34). The OFFSET 
calculation for RMUL is then made in the Control Unit and the 
Memory Module address in the processors (line 35). Synchroni- 
zation occurs and the transfer of RMUL (I,J,K) from Processor to 
Extended Memory occurs. Lines 14 to 37 are looped until IL is 
reached and then the second cycle, line 3 to 38 are executed 
before the subroutine is EXITed. 
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Earlier it was mentioned that this piece of code could have been 
executed as a three dimensional DOALL loop. As can now be seen, 
this would probably not be advantageous in terms of performance 
for two reasons. First, due to the branches on J and K {lines 24 
thru 27) each processor would have to perform the index cal- 
culations of lines 6, 7, and 8 for all I values if one did a 3-D 
DOALL-loop. Second, since IL< 31 one only needs to execute this 
loop with the preliminary compiler IL times with a 2-D DOALL-loop. 
In a 3-D DOALL loops I would have to be computed and a branch 
similar to lines 24 and 25 would also have to be made. At this 
time this appears less efficient in highly branched code and where 
the array fit is good - i.e., on cycle 1, all 512 processors are 
utilized while in cycle 2, 88% of the processors are utilized. If 
the array size were instead EI(25,25,25) then 100% would be used 
on cycle 1 while only 113 or 22% would be used on cycle 2. With a 
3-D DOALL one would have 31 cycles of which 30 would be 100% busy 
and 1 cycle of 50% busy. In that case the additional indexing com- 
putations would be masked in the total execution time. 

3. 2. 2. 4 Assembler Code for TURBDA 

This code is shown in Figures 3-6 and 3-7. 
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L 


1000 

IDEMT 

CU'-IMPLlCIT-TUPBDfl 

1001 

OODESEG 


1002 

EMT 

START 

1003 ^.TflRT 

CILIT 

CPI .0 

icon 

CILIT 

C R2-, 1 

1005 L3 

CTi:; 

CR1 .CP2.L^ 

10 0 0 

C8HFN 

CP3.CP1 .-9 

1003 

CMULL ■ 

CP6,CR3.3I 

toos 

CFETC.H 

CR8.IL 

1009 

CILIT 

C P7 . r 

1 0 1 0 L i i- 

CTIM 

L P7 . C Pt: . L 1 

10 11 

ClflDDL 

CP9 .CP6 . IBSEI 1 

10 12 

ClflDDP 

CP9.CP3.CP9 

10 13 

MOD521 

CP 9 

10 1*t 

C ILIT 

CRIC .31 

10 15 

LOftOEH 

C-P9 .CPI 0 

10 1b 

ClflDDL 

CP9;CRt..ieSEI2 

10 13 

ClflDDR 

CR9 .CP7.CP9 

1010: 

M0D521 

CR9 

1 0' 1 9 

LOftDEMC- 

CR9.CP1C 

1i'i2’0 

ClflDDL 

CP9.CP6 .1DSEI3 

1021 

ClflDDP 

CP9 1 C R7 • C R9 

10 22 

MOD521 

CR9 

10 23 

LOflDEMC- 

C.R9 . L R 1 0 

1 0 2':^ 

ClflDDL 

CR9.CR6.IBSPM1 

10 25 

ClflDDP 

CR9,CP7.CP9 

10 26 

M00521 

CR9 

1027 

STOPEM 

CP9.CR10 

10 28 

JUMP 

LI** 

1029 LI 

JUMP 

L3 

1030 L't 

RETURN 


1031 

END 





Figure 3-6. Handcompiled Control Unit Code 
Subroutine TURBDA 
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L 



1000 

INDEMT 

PE ''IMPLICIT TUPBCifi 

10 fC- 

CODE9EG 


10 20 

EUT 

START 

10 30 '3TRPT 

FLIT 

FPl . I .0 

10*+0 

FDIML 

FPI .FPI .CUl 

1C50 

ILIT 

IP2. I 

1 0t'O 

ILIT 

I R 1 . 0 

1070 L3 

iTi:>; 

IP1 .IR2.LH 

10 00 

SHFL 

I R3 . 1 R2 . -9 

10 90 

PEUCi 

I PH 

1 100 

IftCiD 

IPH,1P3.IRH 

1110 

IDIUL 

IR5,IRH,31 

1120 

ISTOF'E 

IRS.THI 

1130 

I MULL 

IP6.IR5.3I 

11 HO 

I sue 

IP6)IRH.IR6 

1150 

ISTORE 

IR6»JM1 

1 1 60 

IMULL 

IP6.IP“.31 

1170 

IFETCH 

IPS.IL 

1 1 00 

ILIT 

IR7. 1 

1190 L 1 H 

ITI.X 

IP7.IP0.L1 

1200 

IfiDDM 

IP8,IR6,ieSEIl 

120 5 

lADD 

IPS)1P0.IR7 

1210 

1 0521 

IRS 

1220 

LOfiDEM 

IRS. El 

1230 

JftDDM 

IR8.IP6.IBSEI2 

1235 

I ADD 

IPS,IR0.1R7 

12H0 

ID521 

. IRS 

1250 

ILIT 

1 P 1 0 . 1 

1260 

IFETCH 

IPl 1 ..JM1 

1270 

lEO 

IP1 1 .0 .L1C0 

12S0 

ILIT 

IPIO .0 

1290 LI 00 

LOFIOEMC 

I PS . E2 , 1 R 1 0 

1300 

lADDM 

IR8.IP6.IBSE13 

130 5 

I HDD 

IP8.IP8.IR7 

1310 

1D521 

IRS 

1320 

ILIT 

IRIO . 1 - 

1330 

IFETCH 

IPM .I.M1 

I3H0 

lEO 

IR1 1 ,0 .L200 

1350 • 

ILIT 

if; 10 .0 

.1360 L20C 

LCifiOEI'lO 

IRS. E3. IRIO 

I3.’^0 

IFETCH 

IP12.JH1 

1375 

IFETCH 

IP13..JL 

1376 

isueL 

IR13.IR13,! 

1300 

IGT 

IRI2.IP13.LS0 

1 390 

IFETCH 

IP13.LM1 

1 HOO 

IFETCH 

IPIH.hL 

I HO 5 

ISUBL 

IRIH.IPI"*.! 

1H10 

I6T 

IPT3.1P1H .LSO 

IH20 

lEO 

IPl I ,0 .L30 

1H30 

lEC 

IPT3)0 ,L2G 

IHHO 

FFETCH 

FP2.EI 

1H50 

BBS 

FP2 

into 

FMUL 

FP2,FPI .FP2 

1H70 

FSTORE 

FP2.TENP 

IHgO 

■JUMP 

LHO 


Figure 3-7. Handcompiled Execution Unit Code 
Subroutine TURBDA 



3.2.3 


SUBROUTINE STEP (LOOP DO 20) 


The next portion of code to be examined is STEP (loop DO 20) which 
includes CALLS to BTRI and XXM. A number of Figures have been 
made of the code and they are- listed below with a brief 
description. 


Figure 

3-8 

Figut e 

3-9 

Figure 

3-10 

Figure 

3-11 

Figure 

3-12 

Figure 

3-13 

Figure 

3-14 

Figure 

3-15 


The original NASA-AMES FORTRAN of Subroutine 
STEP. 

SAM Extended FORTRAN, for Subroutine STEP 
A comparison file of Figures 3-8 and 3-9 showing 
R( Replacements) , I( Insertions) - (Deletions) 
Preliminary Compiler Code Reorganization for 
Subroutine STEP 

A comparison of the Figures 3-9 and 3-12 
Compiler programatic transformations including 
Transposition Network Settings for Control Unit 
Subroutine STEP 

Same as above for Processor - Subroutine STEP 
Implicit/Steppiece NSS3CU Assembler Code 


Figure 3-16 Implicit/Steppiece NSS3PE Assembler . Code 


Additionally the SUBROUTINES BTRI and XXM are examined. The 
related Figures are: 


Figure 3-17 
Figure 3-18 
Figure 3-19 
Figure 3-20 
Figure 3-21 

Figure 3-22 
Figure 3-23 


Original NASA-AMES Code for Subroutine BTRI 
SAM Extended FORTRAN for Subroutine BTRI 
Comparison of Figures 3-17 and 3-18 
Original NASA-AMES Code for Subroutine XXM 
A modified version of xxMl which will produce 
improved performance on the CDC7600 and SAM 
SAM Extended FORTRAN for SUBROUTINE XXMl 
Comparison of Figures 3-21 and 3-22 


t-fSCEblNG PAGE BUAWK NOT 


ORIGINAL PAGE IS 
OF POOR QUALITY 
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laAio 
19 A20n 
I6ii300 
19 *400 
19*500 
18*600 
19 *roo 
18*900 
18*900 
185000 
185 100 
185200 
185300 
135*00 
1855 00 
185600 
188200 
188300 
188*00 
188500 
168600 
188700 
188300 
188900 
189000 
189100 
189200 
18? 300 
189*00 
18? 500 
139600 
169700 
139800 
13990' 
I? 0000 
19 0100 
19 0200 
10 03 00 
19 0*00 
10 0500 
19 06 OO 
19070'' 
19030P 
19 09 00 
19 1 0 0 Q 
191100 
191200 
19 1 3 0 0 
19 1*0'' 
191500 
191603 
1917 00 
19 1300 
19 1900 
192000 
19 2100 
19 2200 
19 2 300 
19 2*00 
19 25 00 
192600 
192700 
^192800 
.192900 
-19 3000 
193100 
19 3200 
19 3300 
19 3*00 
19 3500 


SUBROUTINE ST P 

COMHaN/BASE/NPAX. JMAX»< KAX,LPAX, JK, KK L K»D T» G AHP A ,G A« I » U 
1 »0X 1»D Y1 #0Z1»ND^ND2»‘ VI 5 ) »FD( ■ ) »HD»AU‘*»G0» 0“EGA » HDX» HO Y, 

2»RM»CNBR»PI, ITR» INVISC. LAHIN,NP»INT1,INT2, INT' 
COHHON/GED/NBl, NB2 » RFR3 N Ta RMAX, XR, XH A X, DRAO. 0 XC 
COMMON/READ/IREAO »IHRir ,NG RI 
CQMHON/VIS /RE.P»»RHUE.?K 
COMMON/ VARS/ 0(7 20^ 6» 30) 

CaHMDN/VAR0/SC720,5»30) 

CGMMON/VAR1/X(720»30)»f ( 720» 30)»Z(720»3‘') 

COMMON /VAR3/P(120»30>. XX(6C»*)»YYC60f*)»ZZ(6'»*) 

LEVEL 2^0^SfX#Y>Z 
COHHON/COUNT/NC.NCl 

COMM ON /B TRIO / A( 60»5»5)*8<60»5»5)»C(60.5»5)»0(63»5»5J^‘'C60, 


RH = S.MU 
C3 = l.+2.*RH 
GAM2 = 2. -GAMMA 
DD 20 L = 2>LM 
DO 20 N = 2,KH 


C 

C*»*FILTRX 
C 


KL = CL-1)*N0+K 
JA = 2 

JS=JMAX-1 

CALL XXH(K»L»1» JHAX) 
DO 12 J=1»JMAX 
Rl =XX(J,1 )*K:'X 
R2 -XX(J,2)*HDX 
R3 =XXCJ»3)*HDX 
R* =XX(J»4)*H0X 


C 

C*»**«** AH ATR X 
C 


RR= l./OCKL.l.J) 

U = Q(KL»2»J)»RR 
V = Q(KL»3»J)*R9 
H = 0(KL»4»J)*RP 
UU = U*R1* V*R2*H»93 
UT = u**2«- V**2 + «»*2 
Cl = GAHI»UT*.5 
C2 = 0(KL»5»J)*RR*6AMMA 
C3=C2-C1 
C*=R*+UU 
C5=GAHI*U 
C6=GAMI* V 
C7=GAMl*b 
D(J.1»1) = 

D(J»1»2) = 

D(J»1»3) = 

0(J>1»*) = 

D(J»1»5) = 

0CJ*2f1) = 

0(J»2»2) = 

DCJ»2»3) = 

0(J»2r*) = 

DCJ,2»5) = 

DCJ»3^1) = 

0(J»3,2) = 

D(J»3^3) = 

DCJ»3»*) = 

0(J»3»5) = 

D(J»*»1) = 

D(J»*,2) = 

D(Jr*>3) = 

0(J^*«* ) = 

DCJ»*»5) = 


R* 

Rl 

R2 

R3 

0 . 

Ri*:i-u*uu 
C*+R1*GAM2*1 
-R1*C6+R2*U 
•-Ri»C7«-R3*U 
R1*GAHI 
R2*CI-V*UU 
R1*V-R2*C5 
C*+R2»GAH2»/ 
-R2*C7+R3*V 
R2*G AMI 
R3*CI-N*UU 
R1*1,-R3*C5 
RZ«R-R3*C6 
C**R3*CAM2*< 
R 3*G AMI 


Figure 3-8. Original Piece of Subroutine STEP 


»rSM ACM 
HCZ 
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Figure 3- 


mm 

19 3300 
193900 
19 4000 
194100 
19 4200 
1? 4300 
194400 
194500 
19 4600 
■19470Q 
19 4300 
194900 
195000 
19 5100 
19 5200 
195300 
195400 
195500 
195600 
195700 
19 5800 
195900 
196000 
196109 
196200 
196300 
19 64 00 
196500 
19660'' 
196700 
196800 
19690f' 
197000 
197100 
19 7200 
197300 
19 74 00 
19 7500 
197600 
19 77,00 
197300 
197900 
19 SOO'' 
193100 
198200 
19830n 
19 84 OO 
218600 
218700 


D(J^5»3) = R2*C3-C6*UU 
DCJ/5>4) = R3»C3-C7*U>J 
D(J#5»5) = .R4+GAHHA*UU 


c«*****end 

C 


OF AHATRX 


12 


CONTINUE 
DO 25 J=JA»JB 
RJ = l./QCKL»6»J) 
RMJ=RM*9 J 

RR = RHJ»3( KL»6rJ'l ) 
RF = RMJ*a(KL^6»J+l) 
DO 23 N=l»5 


23 

25 


A( J»N»n 
AC Jf N»2) 
ACJ^Nf 3) 
AC Of N»4 ) 
AC Jf Nf5 } 
BC Jf N> 1) 
B( JfNf2) 
BC J»N»3) 
BCJf Nf4) 
BC JfNf5) 
CCJf Nf 1) 
CC Jf Nf2 ) 
CC Jf Nf 3 ) 
CC JfNf4) 
CCJ»Nf5> 
ACJf NfN) 
BCJf NfN) 
CC J»Nf N) 


‘DC J-l,Nf 1 ) 
-DCJ-l ,N»2) 
-DCJ-lfNf 31 
-DC J-1 f N»4 ) 
-DCJ-lf Nf5 ) 
0.0 
0.0 
0.0 
0 .*' 

0.0 

OCJ*lf Nf 1) 
OCJ+lf K»2). 
D CJ* 1» N» 3) 
0CJ+lfNf4} 
DC J+lf Nf5) 
ACJf Nf N)-RR 
C8 

CCJf Nf N)-RF 


FCJfN)=SC)CLf Nf J ) 
CONTINUE 


c«****end 

C 


OF filtrx 


S MUST BE ZERO ON E.C. 


21 


20 


CALL BTRi: 
DO 21 J = 
SCKLfl. J ) 
SCtCL .2. J) 
SCKL»3» J) 
SCKLf4f J ) 
SCKL f5f J) 
continue 

RE turn 
END 


2. JR) 
2# JR 
= FCJ. 
= FCJ. 
= FCJ« 
= FCJ. 
= 'FCJ. 


1 ) 
2’. 
3) 
4 ) 
5) 


8, Original Piece of Subroutine STEP (Cont)' 





184eOC 
18 <i3O0 
19<><>00 
184500 
1B4600 
184700 
134800 
184903 
185000 
135100 
135200 
135309 
185409 
195500 
185600 
188200 
168300 
188400 
188500 
188600 
188300 
133900 
189000 
139100 
139200 
139300 
18940C 
189500 
139501 
139502 
189.50 3 
199504 
109505 
189600 
139700 
189300 
139900 
190000 
19 0100 
19 0200 
190300 
19 0400 
19 05 00 
19 0600 
19 07 00 
19.0800 
19 09 00 
1910 00 
19 1100 
19 1200 
191309 
19 1400 
191500 
191600 
1917 00 
191800 
191900 
192000 
19 2100 
,19 22 00 
192300 
192400 
19 2500 
192600 
192700 
192800 
192900 
19 3000 
19 3100 


C 

C* 

c 


SUBROUTINE ST'-R 

GLOB AL/B A5E/NHAX, JHA X,. 3 H AX ►LHA X, J H, XM» L N,G AM HA» 5 AM I , S'<U . F SV 1 CK 
1 »DX1»0 Yl, 0Z1»N0 »N02»‘ Vt 5) rFD3 5 ) »HO»ALR»CO»0 MEGA.HOX. M-T.hOZ 

2»RK»CHBR/PI»INVISC»LA'1I N»NP 
GL03AL/GED/NB1» NB2»RF1) NT» RKAX. XR »XM AX, ORAD. CXC 
GLOBAL /READ /IRE AO , I WRIT , NG R I 
GL09AL/VIS/RE,RR,RHUE,I « 

EXTENDE0/VARS/e<720,30. 6) 

EXTEN3ED/VAR0/5t72O,30. 5) 

EXTENDED /VAR l/Xt 720,30) ,Y( 72C,30) ,Z( 720,30) 
L0CAL/VAR3/PC120,30),XX { 60 ,4 ), Y T( 60 , 4 ), ZZ( 6' , 4 ) 

LEVEL 2, C,S,X,r,Z 
CDNTROL/CDUNT/NC, NC1,[)T 

L0CAL/BTRI0/AX6 0,5,5),3C 60,5,5),C(5 0,5,5),0< 60, 5,5 ) ,F{ 6 ',5) 


RH = SMU 
CS = l.*2.»RH 
GAH2 = 2. -GAMMA 

doall,r=2,rm;l=2,l« 


**filtrx 

KL = (L-1)*ND+S 


INCLUDE XXM1<K,L,- ,JHAX ) 
03 12 J=1,JHAX 
01=a<KL, J,l) 

S2=a(KL,J,2) 

C3=aCKL, J, 3) 

Q4=9(M,J,4) 

Q5=0(KL,J,5) 

R1 =XX(J,l)*HDX 
R2 =XX(J,2)*HDX 
R3 =XXCJ,3)*HDX 
R4 =XXCJ,4)*H3X 


C***»*»*AM ATRX 
C 

RR= 

U = 

V = 

M = 

uu 

UT 
Cl 
C2 


1,/01 

02*PR 

03*RR 

04»RR 

U*R19 V*R?+W*93 
V»»2 + 4 *»2 
GAMI»UT*.5 
05‘RR‘GAMHA 


C3=C2-C1 
C4=R4tUU 
C5=GAMI*U 
C6=GAHI* V 
C7=GAKI*W 
DtJ,l,l) = 
0CJ,1,2) = 
0CJ,1,3) = 
D(J,1,4) = 
0(J,1,5) = 
D(J,2,1> - 
DCJ,2,2> = 
DIJ,2,30 = 
D(J,2,4) = 
0<J,2,5) = 
D{J,3,1) = 
0(J,3,2) = 
0(J,3,3) = 
0CJ,3,4) = 
D<J,3,5) = 
OtJ,4,11 = 


R4 

R1 

R2 

R3 

0. 

R1*C1-U*UU 

C4+Rl*GAH2*J 

-R1*C6+R2*U 

-Rt*C7«-R3*U 

R1*GAHI 

R2»C1-V*U0 

R1*V-R2*C5 

C4+R2*GAM2*/ 

-R2*C7+R3*V 

R2»G AMI 

R3»C 1-H*UU 


Figure 3-9. Identical Piece of Subroutine STEP in SAM 
Extended FORTRAN 
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O0P3’0O 
0019 3500 
0019 3400 
00193500 
00193600 
0019 37 00 
00193300 
0019 3900 
00194000 
0019 <,100 
00194200 
00194300 
0019 4400 
00194500 
0019 4501 
0019 450Z 
0019 4503 
0019 4504 
0019 4505 
0019 4506 
0019 4510 
00194511 
0019 4512 
00194600 
0019 4700 
00194900 
0019 4900 
0019 4901 
00194902 
0019 4903 
00194904 
0019 4905 
00195000 
00195100 
00195200 
00195300 
00195400 
00195500 
00195600 
0019 5700 
00195800 
00195900 
00196000 
0019610C 
00196200 
00196300 
00196400 
00196500 
00196600 
00196700 
00196800 
00196900 
00196901 
00196902 
00196903 
00196904 
00196905 
00197000 
00197100 
0019 7200 
00197300 
00197400 
00197500 
0019 7600 
0019 7700 
00197300 
00197900 
0019 3000 
00196100 
00198200 
00193300 
0-193301 
00193502 
00198303 
00198304 
0019 33115 
0019S306 
001934 00 
00213600 
00218700 




12 


777 

778 


0(J»4»4 ) 
D(J»4»5) 
0( Jr5>l) 
D( J»5r2) 
D( J*5»3) 
DC J»5 f4 ) 
DC J»5.5) 


C4*93*GAM2»J 
R 3*G AMI 

C-C2+2.*C1)*UU 
R1*C3-C5*UU 
R2*C3-C6*U'J 
R3*C 3-C7*U'J 
R4*3 AHMA*UU 


►END OF AMATRX 


23 


CONT 
DC 2 
IF C 
06 = Q 
G6R= 
GO T 
06 H 
06 = 
Q6P = 
RX 
RY = 
RJ = 
RMJ= 
RR = 
RF = 

51 

52 

53 

54 

55 

00 2 

AC 

AC J> 
ACJ» 
AC 
AC 

BC J» 
8C Jp 
BC J/ 
BCJ> 
3( J> 
CCJ» 
CC J> 
CC 

CC J» 
CC J3 
AC J» 
BC Jp 
CCJ» 
CON 
FC J 
FCJ 
FCJ 
FCJ 
FCJ 
CONT 


INUE 
5 J = 2» 
J.GT.2 
CKL> 
OCKL»J 
G 778 
= RX 
RY 

OCKLf J 
= 06 
06 P 
l./Qb 
RH*R J 
RMJ*0 
RMJ*0 
= SCKL 
= SCKL 
= SCKL 
= SCKL 
= SCKL 
3 N=l» 
N»n = 
N»2) = 
N»3) = 
N»4) = 
N»5) = 
N»l) = 
N»Z) = 
N.3) = 
N>4) = 
N/5) = 
N>1 ) = 
N»2) = 
N.3) = 
N.4) = 
N.5 ) = 
N.N) = 
N.N) = 
N.N) = 
TINUE 
» 1 ) = 
. 2 ) ' 
»3) 

.4.) 

»5) 

INUE 


JMAX- 1 
) GO TO 
6) 

- 1 . 6 ) 


♦ 1 . 6 ) 


6M 

6P 

. J . 1 ) 
.J.2) 
.J.3) 
.J.4 ) 

.J .5) 

5 

-DCJ-1 

-DCJ-1 

-OCJ-1 

-OCJ-1 

-OCJ-1 

0.0 

0.0 

0.0 

0.0 

0.0 

DCJtl. 
OCJtl, 
DCJ+1. 
DCJ+1. 
DCJ<- 1. 
ACJ.N. 
C8 

CCJ. N. 


777 


.N. 1 ) 
.N.2) 
.N.3) 
.N.4 ) 
.N.5) 


1 ) 
2) 
3) 
'4) 
) 


N)-R9 

N)-RF 


51 

52 
S 3 

54 

55 


25 
C 

C«*»»*END OF FILTRX 
C 
C 

c 
c 


5 MUST BE ZERO ON B.C. 


21 


CALL BTRK2.JM) 
00 21 J = 2.JH 

51 = FCJ.l) 

52 = FCJ. 2) 

53 = FCJ. 3) 

54 = FC J.4 ) 

55 = FCJ. 5) 
SCKL.J.l ) = SI 
SCKL. J.2) = S2 
SCKL. J.3 ) = S3 
SCKL. J.4) = S4 
SCKL »J.5) = S5 

CONTINUE 

ENDDOfENDOO 

RETURN 

END 


OKIGINAL PAG^ 
OF POOR QUAUT* 


Figure 3-9. Identical Piece of Subroutine STEP in SAM 
Extended FORTRAN (Cont) 
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3. 2. 3.1 Sam Extended FORTRAN for SUBROUTINE STEP (LOOP DO 20). 

Figure 3-10 shows the changes made in the original NASA-AMES 
program to produce SAM Extended FORTRAN. As can be seen, the 
greatest number of changes occur in the declarations. Only the 
named COMMON blocks, VARS, VARjZf, and VARl need to be put in 
Extended Memory. Note for simplicity in accessing the last two 
extents on the S and Q matrices were interchanged. 

The Named Common Blocks VAR3 and BTRID are put in LOCAL Memory. 

It should be noted that in another portion of the program, 
SUBROUTINE METOUT, the arrays XX, YY, and ZZ are written out after 
the subroutine calls. This would not be permitted and an 
additional copy to Extended Memory Arrays, say XXI, YYl, and ZZl 
would be needed. Also, the P array is used in a variety of ways 
including an EQUIVALENCE statement in other portions of the code. 
However, for this specific portion of the code the P array is not 
accessed in any way and so for convenience was left in LOCAL for 
the example. Copies of all data in GLOBAL memory are assumed to 
be in Processor Memory. 

The only other changes to the program were the replacement of the 
DO 20 loops with the two dimensional DOALL loop (and ENDDO 
statement) and the replacement of the CALL statement in line 
18970)3 to an INCLUDE since the PROCEDURE XXMl has Extended Memory 
References. (Further discussion of this will be supplied later.) 


ORIGINAL PAGE IS 
OF POOR QUALiry 
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1 R l'3<i20T 

2 R 184400 

3 R 134503 

4 R 18460'' 

5 R 184703 

6 R 184803 

7 P 184900 

8 P 13500) 

9 R 135101 

10 fi 135301 

11 R 18540') 

12 R laaeo-i 

13 “ 136701 

14 R 139403 

15 R 198401 


GUlBftL/B4SE7 HI AXp JPAX, KM A X , LMA X, JM, KV,LM,3AH»A,GAM,SMl-rS“ACI- 
2»RM>.CHBR»PI»I'I VISC,LAMIN,HP „ „„ „ .vo=.t. -vr 
GL06AL/GE0/N31 » N5 2» Rf RON T» PMA X^ XR »XM AX»0 R AO- XC 
GLOBAL /REA07I? EAO»IkRI,T» AGRI 
GLOBAL/VIS/RE* PR^ RHUE»RK 
EXTENDED/VARS' C(723»30, 6) 

EXTENDED/VARO' S(720»30^5) 

EXT ENDED/ VARl' X(720,3 0)»Yf 720. 30 ) . Z (-720. 3 ) ? 

LOCAL/ VAR3/P(l 20. 30 )»XXC60 »4). YY(60»4 ),7Z(6) .4) 

LOCAL/ BTR ID/ At 6o'.5^5).BC 60 . 5.5 ) »C ( 6 C .5 .5) .0 ( 6 3 . 5 . 5 ) . F(6 0 .5 1 

doall.k=2»kh;.=2.lh 

INCLUDE XXHKC .L.1. JMAX) 

enddd;end:'0 


Figure 3-10. Comparison of Original and SAM Extended 
FORTRAN - Subroutine STEP 
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3.2. 3.2 Preliminary Code Analysis and Code Reorganization for 
STEP. 

Figure 3-11 shows the prelitainary code reorganization that would be 
performed by the compiler. The DO loop variables’ in line 194500 ■ 
have been modified so that they now read DO 25 J=2, JMAX-1. This 
was done so that the initial and terminal values are composed of 
literals or Global variables that would exist both in Processor 
and Central Memory. 

The code only accesses the arrays Q and S from Extended Memory. 

The accessing of the Q array is shown in lines 189501-189505 and 
in lines 194501-194510. The notation for this data movement from 
Extended Memory to Processor Memory is with the FORTRAN statement 
Q1=Q(KL, J, 1). (This notation is used for clarity and is not 
meant to be an implied ASSIGN statement.) The accessing of Q(KL, 
J-1, 6) is only necessary of J=2 for the other values exist in 
Processor Memory, hence, the IF test and branch at line 194501. 
Since the DO 25 loop exists in both the Processor and Control Unit 
Code the execution pattern is: 

1. Set J=2 


2. 

Synch 

for 

fetch 

■Q(KL, 

2, 

6) 

2. 

Synch 

for 

fetch 

Q(KL, 

1, 

6) 

3. 

Synch 

for 

•fetch 

Q(KL, 

3, 

6) 

4. 

Set J= 

= 3 





5. 

Synch 

for 

fetch 

Q(KL, 

4, 

6) (2 and 3 already in Processor 


Memory) 

6 . Set J=4 

7. Synch for fetch Q(KL, 5, 6) (3 and 4 already in Processor 
Memory) 
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IHPLICIT/STEPPIECENSSI tl2/Z2/7r) 


le^ioo 
184200 
184100 
104400 
184500 
184600 
184700 
184 900 
184900 
185000 
185100 
185200 
185300 
185400 
1855:-0 
185600 
188200 
188300 
188400 
186500 
188600 
188800 
188900 
189000 
189100 
189200 
189300 
199400 
189500 
189501 
18953 2 
189533 
189504 
189535 
189600 
189700 
189800 
189900 
190000 
190100 
190200 
190300 
190400 
190500 
190600 
193700 
190800 
190900 
191000 
191100 
191200 
191300 
191400 
191500 
191600 
191700 
191800 
191900 
192000 
192100 
192200 
192300 
192400 
192500 
192600 
192700 
192800 
192900 
193000 
193100 
193200 
193300 
193400 
193500 
193600 
193700 
193800 
193900 


C 


c 

c 

c 


c 

C***F 

C 


SUBROUTINE STEP 

GL0a4L7BA5E7NM4X» JNAX>^KHAX»LHAX*JN»KNrLKrGAHNArGllHI*Smi*rSNACM 
1 »DX1.0Ylp0Zl>ND«ND2>Fy<5 3.F’DC5)»HD»ALPf GO«OHEt:A»NDX»HDr>HtiZ 
ZpRPpCNBRpPIpINVISCpLAHINpNP 
GL0BAL/GE07N31pNB2pRrR0NT*RNAXrXRpXMAXFDi<ADpDXC 
GLOBAL/RE AD/ IREAOpIHRITpNGRI 
GLOBAL/VIS/RE»PR»RHUEpRK 
EXTENDED/VARS/8(720»30»6> 

EXTENDEO/VARO/S1720p30»5) 

EXTENDED/»AR1/X<720,301»T<720»30}»Z<720p 3CI 
EOCAL/VAR3/PC120p30)pXX(60»4)pYr(60p4)pZZ(60p4) 

LEVEL 2pOpS.X»Y»Z 
CQKTROL/COUNT/NC*HClrOT 

LOCAL/BTRIO/AI6Cp5.5)»B(60»5p5>pCt60.5p5)»D(60»5p53»FI60»5> 


RH = SNU 
C8 = l.*2.*R« 

GAM2 = 2.-GAHNA 
D0ALLpK=2»KM;l=2pLH 

IITRX 

KL = tL-l3*N0*K 


INCLUDE XXHUKpLpIpJNAX) 
OD 12 J=1pJNAX 
01=Q(KLp JpI) 

Q2^QCKL»J»2) 
a3=B(KL» J»3) 

04=0(KLp Jp4 ) 

QS-QCKLr J p53 
HI =XX<Jp1)*HDX 
R2 =XXCJp2)*HDX 
R3 =XXIJp3)«HDX 
R4 =XX<J»4J*HDX 


C***»*»*ANATRX 

C 

RR= 


l./Ol 
U = fl2*RR 
V = a3*RR 
K = 04*RR 

UU = U*RI»V*R2*V*R3 
UT = U**2*V**2*V**2 
Cl - GAHI*UT*.5 
C2 = a5»RR»GANHA 
C3=C2-Cl 
C4=R4 »UU 
C5=GAHI*U 
C6=GAHI*V 
C7=SAHI*H 
OCJpIpI) 


DC Jp1p2> 
D<JpIp3) 
0<Jp1p4) 
DCJpIpSI 
DC Jp2p13 
DC Jp2p2} 
DC J*2>3} 
DCJp2p4) 
0CJp2p5) 
DCJp3p13 
DCJp3p2> 
DCJp3»3} 
0CJ>3p41 
DC Jp3p5) 
DCJp4»1) 
0CJp4»23 
0CJp4p3) 
DC Jp4p43 
DC Jp4p5) 
DCJpSpI) 
DC Jp5p2> 
DCJpSp3) 
DCJ>5>4) 


R4 

R1 

R2 

R3 

0 . 

R1*C1-U*UU 

CAtRl^GAHZtU 

-R1*C6*R2*U 

-'R1*C7*R3*« 

R1*GANI 

R2*C1-V«U0 

R1*V-R2»C5 

C4*R2*GAM2»V 

-R2*C7*R3*V 

R2»GAMI 

R3»Cl-H*UU 

R1*K-R3*C5 

R2»K-R3*C6 

C4«R3*GAH2*U 

R3*GANI 

C-C2»2,*C1)*UU 

R1*C3-C5*UU 

R2*C3-C6*UU 

R3*C3-C7*UU 


ORIGINAL PAGE IS 
OF POOR QUALITY 


Figure 3-11, Preliminary Compiler Code Reorganization 
Subroutine STEP 
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ORIGINAL PAGE IS 
OF POOR QUdyJ5Y 


194000 


DtJ»5»5> = R4*G4«KA»UU 

194100 

C 


194200 

c**»* 

**CN0 OF AHATRX 

194300 

c 


194400 

12 

CONTINUE, , 

194500 


00 25 Js2»JMAX-l 

194501 


IF tJ.GT.2) GO TO 777 

194502 


06=0<KLr J»61 

194503 


06K=6CKL»J-1»6 ) 

194504 


So TO 778 

194505 

777 

fi6H = RX 

194506 


06= RT 

194510 

778 

06P=Q(KL» J«l»6) 

194511 


RX = 06 

194512 


RT = 06P 

194600 


RJ = 1.706 

19470.') 


RHJ=RK*RJ 

194600 


RB = RKJ*06H 

194900 


RF = RNJ*96P 

194901 


SI = S<KL» J»^l > 

194902 


S2 = S(KL>J*2) 

194903 


S3 = S(KL*i»3J 

194904 


S4 = S<KL»J»4) 

194905 


S5 = S(KL»J.5) 

195000 


DO 23 N=l»5 

195100 


AfJ»N»l) = -D1J-1*N»13 

195200 


ACJ>N>2} = -0tJ-l*N»2> 

195 500 


A(J»N>3} = -0CJ-l»N»3J 

195400 


ACJrN*43 = -DU-l»N»4) 

195500 


ACJrN^S) = -DCJ-l»N*5> 

195600 


= 0.3 

195700 


B(J»Nr2) = 0-3 

195800 


3(J»N»3) = 0-: 

195900 


8<J.N.4> = 0.0 

196000 


B<J.N,5> = 0.C‘ ^ , 

196100 


C(J*Nrl} = DtJ*l»NrlJ 

196200 


C(J>N>2) = 0<J*1.N.2> 

196300 


CCJ*N»3) = 0CJH*N»3> 

196400 


CtJ»N»4) = 0(J*l»N.4) 

196500 


CCJ»N»5) = DCJ*1»N»5) 

196600 


ACJ.N.N) = AtJ.N.N)-RR 

196700 


8CJ«N»N3 = C8 

196800 


C<J»NrNJ = C1J»N»N3“RF 

196900 

23 

CONTINUE 

196931 


FCJ»l> = SI 

196932 


F(Ji.2> = S2 

19693 3 


FCJ»3) = S3 

196934 


F(J.41 = S4 

196935 


F(J»5) = S5 

197000 

25 

CONTINUE 

197100 

C 


197200 

C*****END OF FILTRX 

197300 

C 


197400 

c 


197500 

c s 

HOST BE ZERO ON 8.C. 

197600 

c 


197700 


CALL BTRI(2»JM) 

197800 


DO 21 J = 2#JN 

197900 


SI = FtJ»l) 

198000 


52 = FU»2J 

198100 


S3 = FtJ»3> 

198200 


S4 = FCJ»4> 

198300 


S5 = FCJ.Sl 

198301 


STKL.J.IJ = SI 

198332 


StKL.J.2) = S2 

198 33 3 


SfKL»J#3.) = S3 

198304 


S(KLrJ»4) = S4 

198 33 5 


STKL»J.51 = S5 

198400 


enodo;enddo 

218600 


RETURN 

218700 


END 


Figure 3-11, Preliminary Compiler Code Reorganization 
Subroutine STEP (Cont) 
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In SUBROUTINE TURBDA branches on the DOALL variable were 
demonstrated. This example demonstrates branching capability in 
fetching on inner nested DO loop variables. 

Finally the fetching and storing of the array S is shown in lines 
194901-194905 and 19830-1-198305. Because of the notation chosen, 

ii> 

i.e.. Si = S(KL, J, i) the statements were removed from the DO 
LOOP (23) on N. This is not a requirement. An array, say SS with 
subscripts could have been declared , with a simple DIMENSION 
Statement . 

r 

Figure 3-12 shows the lines of code that have been replaced (R), 
inserted (I), or deleted (-). 

3.2.3. 3 Programmatic Transformations by the Compiler and 
Transposition Network Calculations for STEP Portion 

Figure 3-13 and 3-14 shows explicitly the address calculations for 
setting the Transposition Network Offset (3-13) and the Memory 
Module address (3-14) for each access from Extended Memory. 

Considering the Control Unit Code first in a line by line basis: 

188600 Hidden loop N has 2 cycles 

188601 Calculation of # of PE's used to that cycle 

188601 Address of Q(IW+1,1,1) in memory which is in PE#=^. 

i.e., on cycle 1 the address of Q{ 1,1,1) is equal to 
the base address of Q in memory. On cycle 2 the 
address of Q(513,l,l) is the base address of Q plus 
512. 

188602 Address of Q(IW+1,1,2) is 42,600 greater than 
Q(IW+1,1,1) 

188603 - 188639 Similar other calculations for S and Q 
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1 

R 

189200 



2 

R 

189300 



3 

I 

189501 


01=0(KL>J»1) 

4 

I 

189502 


Q2=0(KL» j>2) 

5 

I 

189503 


03=Q(KL»J» 3) 

6 

I 

189504 


a4=QCKLr J«4> 

7 

I 

189505 


05=Q<KL>J»5) 

S 

R 

190300 


RR= l./Ol 

9 

R 

190400 


(J = 

10 

R 

190500 


V = 03 *R- 

11 

R 

190600 


= Q4*R9 

12 

R 

191000 


C2 * 05»RR»G41HA 

13 

R 

194500 


DO 25 J=’»JM43-1 

14 

I 

194501 


IF tJ.CT.2) 30 T 

15 

I 

194502 


Q6-0<KL» Jr6) 

16 

I 

194503 


06K=0(KL»J-1.> > 

17 

I 

194504 


30 TO 778 

la 

I 

194505 

777 

36H = flX 

19 

I 

194506 


06= RT 

20 

1 

194510 

7 78 

a6P=0(KL ^J+l»> ) 

21 

I 

194511 


RX = 06 

22 

I 

194512 


RY = 06P 

23 

R 

194600 


RJ = 1 ./06 

24 

R 

194300 


RR = RRJ*06H 

25 

R 

194900 


RF = =FJ*06P 

26 

I 

194901 


SI = SCKL»JfI ) 

27 

1 

194902 


S2 = ) 

2a 

I 

194903 


S3 = S(KL»J»5 ) 

29 

I 

194904 


S4 = S(KL»J»i ) 

30 

I 

194905 


S5 = S( KL» J.i 1 

31 

R 

196900 

23 

CONTINUE 

32 

I 

1-96901 


FCJ/1) = SI 

33 

1 

196902 


F(J>25 = S2 

34 

I 

196903 


FCJ»3) = S3 

35 

I 

196904 


F( J»4 1 = S4 

36 

1 

196905 


F<J»51 = S5 

37 

R 

197900 


SI = FTJ.l) 

38 

R 

198COO 


S2 = FtJ»2) 

39 

R 

198100 


S3 = F<J,3) 

40 

R 

198200 


S4 = F(J»4) 

41 

R 

19830C 


S5 = F'tJ»5 ) 

42 

I 

198301 


SC KL»J.l) = SI 

43 

I 

198302 


S(KL»J»2) = S? 

44 

I 

198303 


SC KL/J>3) = S3 

45 

I 

198304 


S( KL»J>4) = $4 

46 

I 

198305 


SCKL»J»5) = S5 

47 

I 

198306 

21 

continue 


Figure 3-12. Comparison of SAM Extended FORTRAN and Compiler 
Reorganized Code •" Subroutine STEP 
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001^4001 
00194002 
00194003 
00194004 
00184005 
00184006 
00194007 
00184100 
00134200 
00184300 
0019 4400 
0C184500 
00184600 
00134700 
00134900 
00134900 
00195000 
00195100 . 
00195200 
00195300 
00195400 
001955 00 
00195600 
00183200 
■0019330C 
00199400 
00188500 
00193600 
00183601 
00198630 
00198631 . 
00198632 
00199633 
00193634 
001B8635, 
00188636 . 
00198637 
00198638 
00198639 
00189640 
00188641 
00189642 
'00198643 
00188644 
.00138645 
00188646 
00188647 
00139648 
00139649 
00193650 
00188651 
00188900 
00139900 
00199000 
00189100 
00139200 
00199300 
00139404 
00139405 
00199406 
00139407 
00189408 
00189409 
00139410 
001994 11 
0C1394 12 
001894 13 
00139414 
001394 15 ■ 
00139416 


THE COMPILER WILL HAVE DErERHINEO THE NUMBER CF CV'LES 
or THE HIDOEN LOOP 

IE. NHIODEN = (720/ND*LMAX+N-l)/H ■= 2 CYCLES 
ALSO THAT ISKIP=1 
SUBROUTINE ST- P 

GL0BAL/BA5E/NHAX> JHAX,< HAX , L MAX, JH, KKp L M AMP*. .3 Am 1 ► SH L. FSK ACF 
1 *0X1,0 Y1 ,DZ1,N0,N02,* VC5),F0(5) ,-TD,AL P*GO* CHEGA , HOX, HC T, H C 2 

2,RM,CNBR,P1,INVISC,LAMI N,NP 
GLQBAL/GEO/KBl, NB3,RFR3 NT, RMAX, XR,XHAX, ORAD. OXC 
GLOBAL/REAO/IREAO,IHRir ,NGRI 
GLOBAL/VIS/’RE,PR,BHUE,?K 
■ EXTENDED/VARS/0C720,30* 6) 

EXTENDEO/VARO/S<720,30- 5) 

EXTENDED /VARl/Xt720»30) , Yt 72C,30) ,2(720, 30) 

L0CAL/VAR3/P(12 0,30) *X( ( 60 , 4 ), Y Y( 60 * 4 ), 2 2( 6 ' * 4 ) 

LEVEL 2,Q,S,X,Y,Z 
CONT-ROL/CDUNT/NC»hCl*or 

L0CAL/BTRID/A(6 0,5,5 )*J C 60 ,5 ,5 ), C( 6 0 ,5, 5 ) ,D ( 6 C, 5, 5 ) *'’ t 6 5 ) 


DO 20 N 
IVV=512 
IA0Q1=I 
IA00 2= I 
IA003=1 
1A004=I 
IA0a6=I 
IA0S1=I 
IA0S2=I 
IA0S3=I 
IA0S4=I 
IA0S5=I 
lAOXH = 
lAOXP = 
lAOX MN = 
IAOXPN= 
lAOYM » 
lAOYP = 
1A0YHN= 
IAOYPN= 
lAOZH = 
lAOZP = 
IAOZHN= 
IAOZPN= 


= 1,2 
*N-51 
BSQl 
AOOl 
A002 
A 00 3 
A085 
BSSl 
AOSl 
AOS 2 
AOS 3 
A0S4 
I8SX 
IBSX 
I9SX 
IBSX 
IBSY 
IBSY 
IBSY 
IBSY 
IBSZ 
IBSZ 
IBSZ 
IBSZ 


2 

■* IVV 

♦ 42600 

♦ 42600 

♦ 42600 

♦ 42600 
+ IVV 

♦ 42600 

♦ 42600 
+ 42600 

♦ 42600 
♦IVV-1 

+ IVV * 1 
♦IVV-ND 

♦ IVV+NO. 

♦ IVV- I 

♦ IVV + 1 
♦IVV-ND 
♦I VV+NO 
♦lVV-1 

+ IVV ♦ 1 
♦IV V-NC 
♦IVV*ND 


c*»*filtrx 


SYNCH 

SYNCH 

SYNCH 

SYNCH 


KL = (L-l)*ND + '( 

00 10 J = 1,JHAX 
JJ= (J-U*720 

IFSET06 = MOD(( lAOOS^JJ )*5Z1 ) 
IFSETXK = M0D(( IAOXH^JJ ),521) 
IfSETXP = H0D(( lAOXP^JJ ),52l ) 
IFSETXHN= M0D(( lAOXKN ♦JJ),521) 
IFSETXPN= M0D(( IAOXPN ►JJ),521) 


Figure 3-13. Control Unit Code for SAM 
(META Assembler) 


Subroutine STEP 
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SYNCH 

SYNCH 

SYNCH 

SYNCH 

SYNCH 

SYNCH 

SYNCH 

SYNCH 


irSFTYM = MCO(CIAOYK«-Ji )»52l) 
IFSCTYP = MOD<( lAOYP+JJ )»521 ) 
IFSETYMN= MOOYCIAOYHN fJJ)>521) 
IfSETYPN= MODttlAOYPN ►JJ1.521) 
IFSETZM = MOO(CIAOZH+JJ )»521 ) 
IFSETZP = HOD(( lAOZP + JJ )»52n 
IFSETZHN= HOO<tlAOZHN ^JJ},^5^1) 
IF5ETZPN= HOD(t lAOZPN i-JJ),521) 


10 continue 

DO 12 J=1»JHAX 
JJ=t J-1) *720 

IFSETOl = MOCtCIAOOl+J J).521) 

SYNCH 

1FSET02 = K0D(< IA002+) J), S21) 

SYNCH 

IFSET83 = Moot C IA0Q3*J J) »521) 

SYNCH 

IFSETOA = HOOt t lAOOA+J J ). 52 n 

SYNCH 

IFSET05 = HOOCtlAOaStJ J)»521) 


Figure 3-13. • Control Unit Code for SAM - Subroutine STEP 
(META Assembler) (Cont) 
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8 ? 


[9A60O 
19 4S0C 
19 4501 
194503 
19 4520 
194521 
19 452 3 
194524 
194525 
19 452 6 
194527 
194528 
19 4529 
19 4530 
19 4531 
194532 
19 45 31 
194534 
194535 
19 4536 
19 45 37 
19 4538 
194600 
194700 
19 4900 
194900 
195000 
195100 
195200 
195300 
195400 
195500 
195600 
195700 
195500 
19590C 
196000 
19 6100 
196200 
196300 
19 6400 
196500 
196600 
196700 
19 6300 
196900 
196901 
196902 
196903 
196904 
197000 
19 7100 
197200 
19 7300 
19 7400 
197500 
197600 
197700 
19 780C 
19 7801 
19 7900 
198000 
193100 
193200 
19 3300 
193301 
19 8302 
198303 
198304 
198305 
198306 
198307 
19B306 
19 8309 
198310 
198400 
215600 
218700 


12 


SYNCH 

SYNCH 

777 

778 

SYNCH 

SYNCH 

SYNCH 

SYNCH 

SYNCH. 


continue: 

DO 25 J=2»JHAX-1 
JJ= CJ-1)*720 
IF(J.GT,2) GO TO 77^ 

IF5ETQ6 = M00ttIA0S6+JJ)»5' n 

IFSET06H = HOO((IAOQ6tJ J-720)»521) 

GO TO 77 8 
CONT INUE 

IFSE TO&P =MCD( t IA00f.»JJ +720) r5Zn 
IFSETSl = HOOK IA0S1*JJ.)»521 > 
IFSETSZ = M0D(CIA0S2+JJ )»521 1 
1FSETS3 = HOOK IA0S3*JJ ). 521 ) 
IFSETS4 = M0D(tIA0S4*JJ ).521 ) 
IFSETS5 = MODft lAOSS^JJ >>521 ) 


f -n A rs.li' 




25 CONTINUE 
C 

C*****END OF FILTRX 

C 

C 

C S MUST BE ZERO ON B.C. 
C 

CALL BTRK2.JH) 

00 21 J=2>JH 
JJ= (J-1) *720 


SYNCH 

SYNCH 

SYNCH 

SYNCH 


IFSETSl = 

IFSETS2 = 

1FSETS3 = 

IFSETS4 = 

IESETS5 = 
CONTINUE 
CONTINUE 
RETURN 
END 


HOOtt lAOSl+JJ >,521 ) 
HOOtl 1A0S2 + JJ J >5 21 J 
HOD(( IA0S3+JJ )>521 ) 
HOOK IA0S4*JJ >»521 ) 
MODt( IA0S5*JJ )»521 ) 


Figure 3-13, Control Unit Code for SAM - Subroutine STEP 
(META Assembler ) (Cont) 



19(i001 
13«00? 
13W03 
ie%oo<i 
134005 
134006 
134007 
134008 
1S4009 
•1340 1C 
►134011 
a84012 
13 4013 
134014 
184015 
13 4016 ■ 
184017 
184100 
184200 
134300 
13 44 00 
134500 
134600 
134700 
134300 
134900 
135000 
135 100 
135200 
185500 
185400 
135500 
135600. 
183200 
13S300 
133400 - 


THE COMPILER h ILL HAVE OEIERHINEO THE NUH3EH OF CY'LES 
OF THE HIDDEN LOOP 

IE. NHIOOEN = (720/HD*LMAX+N-1)/H = 2 CYCLES 
ALSO that I5KIP=1 

******* 

******* note all SYNCH3 IN THE PE CODE A i»E «I TH TH' 

PR3VIS0 THAT TH! Y 00 KDT FETCH FOR K.LT. .OR 
K.GT.KH OR L.LT. 2, OR L-GT.LH 

IE SYNCH SHOULD BE REPLACED WITn AN EXPRESSION 

SYNCH WITH HOOE= 0 FOR K. LT .2* K .3 T .K H *L .LT. ’ »L . ‘ T .L m 
NOTE THE IF BRAICHES IN EACH DO LOOP AFTE’ TH-' 

SYNCH CODE 
SUBROUTINE ST=“P 

GLOBAL/BASE/NHAX, vHAX*lPAX»LPAX»JK* KH.LH/GANHA *G A« I *S‘<U < FSk ACH 
1 *0X1*0 Y1 *0 Zl*ND>ND2*PVtS)*F0(5) *HO*ALP*GD* CPE GA * HO X, HD Y* H D 2 

2*RM*CNBR*PI* I NV ISC* LAN! N* NP 
GLOBAL/GEO/NBl* N82*RFR1 NT* RKAX* XR*XMAX* DRAO* CXc 
CLOBAL/REAO/IREAD*ItiRir * N3RI 
GLQSAL/VIS/RE*PR*RMUE*IK 
EXTEN0ED/VARS/at72O*30* 6 ) 

EXTEN0ED/VARO/Sf72O*30. 5) 

EXTENDEO/VARl/X 1720*30) *Y( 72C*30) *Z I 720*30) 

LOCAL/V AR3/Pt 120* 30) *X( C60.*4 )* Y Y( 60 » 4 )* Z ZI 6Y * 4 ) 

LEVEL 2*Q*S*X*Y»Z 
control /CD UN T 7NC*^C1* or 

L0CAL/BTRID/A(B0*5*5)*3I 60»5*5)*C(6 0*5*5).D{ 6 2*5*5) *FC6:!»5) 


SHU 

l.-t-2.»RH 


185500 

GAK2 = 2. -GAMMA 


188600 

DO 20 N=l*2 


138601 

IVV=512*N-512 


138602 

IV = IVV *■ IPEND 


1866 03 

LNl = IV/ND 


188604'‘ 

L = LHl + 1 


138605 

KHl = 1V-LK1»ND 


188606 

K = KMl ♦ 1 


188607 

lAOOl^ISSOl * IV 


138608 

IAOOZsIaOQI * 42600 


138609 

IAOQ3=IA032 ♦ 42600 


1386 10 

IA004=IA0Q3 * 42600 


188611 

IA036=IA005 ♦ 42600 


186612 

lAOSlsIBSSl + IV 


138613 

IAOS2=IAOS1 ♦ 42600 


188614 

IA0S3=IA0S2 ♦ 42600 


138615 

IA0S4=IA0S3 * 42600 


138616 

IA0S5=IA0S4 ♦ 42600 


183617 

lAOXH = IBSX *I'V-1 


138616 

lAOXP = 18 SX IV + 

1 

1886 19 

IA0XHN= IBSX tlV-ND 


188620 

IAOXPN= IBSX tlV+ND 


.138621 

lAOYM = IBSY •HV-1 


188622 

lAOYP e IBSY ♦ IV ♦ 

1 

188623 

1A0YMN= IBSY tlV-AD 


138624 

IAOYPN= IBSY flV+KO 


138625 

lAOZM = I3SZ +IV-1 


138626 

lAOZP = ISSZ ♦ IV t 

1 

1886 27 

IAOZHN= I8SZ ♦IV-NO 


138628 

IAOZPN= IBSZ +IV*-KD 


1B8800 

C 


188900 

C***FILTRX 


139000 

C 


139100 

KL = tL-l)»ND4-K 



Figure 3-14. Processor Code for SAM - Subroutine STEP 

(META Assembler) 
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\l\ 

1924 00 
19 2500 
192600 
192^00 
192900 
19 29 00 
19 3000 
19 3103 
19 3200 
19 3300 
19 340C 
19 3500 
19 3600 
19 37 0 0 
19 3900 
19 3900 
194000 
19 4100 
-194200 
194300 
19 4400 
194500 
194501 
194502 
194512 
19 4513 
194514 
194515 
19 4 5 1 6 
19 451? 
19 4518 
194521 
194522 
194523 
19 4524 
19 4525 
194526 
19 4527 
194528 
194529 
19 4530 
19 4531 
19 4532 
194538 
194539 
19 4540 
19 4600 
19 4700 
19'4800 
194900 
195000 
195100 
19 52 00 
195300 
195400 
19 55 00 
19 56 00 
195700 
195300 
195900 
196000 
196100 
196200 
19 6300 
19 64 00 
196500 
19 66 00 
196700 
196800 
196900 
196901 
196902 
196903 
196904 
196905 
19 7000 
197100 
197200 
197309 


g ( J»2r4> 

( J»2.5) 

0CJ,3>2 ) 
0( 

D( J»3»4> 


0(J> 3*5) 
D.Csl>4> 1) 
D(J»4r2) 
DC J>4>3)‘ 
DC Jr4>-4 ) 
DC J»4^5) 
DC J»5»l) 
DC J.5»21 
DC J>5>3) 
DC J>5 p4 ) 
DC J, 5»S ) 


-R1*C7«R3.U 

R1*G AHI 

R2»C1-V*UU 

R1*V-R?*C5 

C4*^R2*GAH2«/ 

-R2»C7«R3*V 


R 2 *GAHI 
R 3 *C l-w»UU 
Rl«h“R 3 *C 5 
R 2 *k-R 3 »C 6 
C4 +R3*GAH^*^ 

R 3*G AHI 

C-C2+2-»Cl)»UU 
R1*C3-C5*UU 
R2*G 3-C6*U'J 
R3*C3“C7»Ud 
R4»G AMVA*U'J 


C*****»END 
C 


GF AHATRX 


12 COMTINUE 

DO 25 J=2» JHAX- 1 
JJ* CJ-1)»720 
IF CJ.GT.2) GO TO 777 
HADD06 = C I A036«-JJ)/52 1 

KADDQoH = C I A0Q6+JJ-72J 1/521 

GO TO 770 
C6H = RX 
Co = R.y 

HADD06P =t IAOQ6+JJ+720) /521 
HAOCSl = ( IA0Sl■^JJ)/52l 
HA0DS2 = t IA0S2 + JJ)/52l 
KADDS3 = : IA0S3+JJJ/521 
HAD0S4 = t IA0S4 + JJ1/521 


SYNCH 

SYNCH 

7 77 

77 6 
SYNCH 

SYNCH 

SYNCH 

SYNCH 

SYNCH 


HAOOS5 


IFC CK.LT.2).0R 
RX = C6 
RY = Q6» 

RJ = SCKLf6»J) 
RJ = 1./06 
RMJ=RH*R J 
RR = RHJ*96“ 

RF = RHJ»a6P 
00 23 N=lf5 


IA0S5+JJ)/521 

‘ ■■ CK.GT.C M ), Q!J , 


CL.LT. ?).OR.CL.gT.LM) ) 


23 


AC 

AC J»N»2) 
AC Jf Nr5) 
AC JrN>4 1 
AC J>N>5) 
BC J^N.l) 
BC Jf Nr2) 
BC J»N»3) 
BC J,N, 41 
BC J»N»5) 
CCJf Nrl ) 
CC J,Nf2) 
CC J>Np3) 
CCJ»N.41 
CC J>N»5) 
AC J^N^Nl 
SC JrN^Nl 
CC JrNfN) 
CONTINUE 
FtJ»ll = 
FCJ»2) = 
FCJ»3) = 
■^CJ»4) = 
FtJ»5) = 
CONTINUE 


-ocj-i»N»n 

-D(J-l>N.21 
-D(J-i»N»3 ) 
: -0CJ-1,N*4) 
i -0CJ-1»N»5) 
0.0 
: 0.0 
: 0.0 
: 0-0 
0.0 

■ OCJ+l.N.l) 

■ DCJ+1.N.21 
DCJ+l,h.31 
DCJ91.N.4) 
DCJ+l»N»5) 
AC J. N» N)~RR 

ce 

CCJ.N.M'RF 

51 

52 

53 

54 

55 


C**»**ENO 

C 


OF F IL T R X 


Figure 3-14, Processor Code for SAM - Subroutine STEP 
(MET A Assembler ) (Cont) 
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00 10 J = 

1 r JHAX 


JJ= (J-1)*Z20 


HADOOG - 

{ IA006*JJ)/52 1 

SYNCH 

HADOXH = 

C I AOXK* JJ)/5? 1 

SYNCH 

HADOXP = 

t IA0XP+JJ)/5? 1 

SYNCH 

MADDXMN= 

( lAOXHN ♦ JJV 521 

SYNCH 

HADDXPM= 

[ lAOXPN +JJ)f 521 

SYNCH 

KAODYH = 

t I A0YK+ JJ)/5’ 1 

SYNCH 

MADDYP = 

( I A0YP+JJ)/52 1 

SYNCH 



MADDYHN= 

C lAOYHN t JJ)' 521 

•SYNCH 

KADOYPN= 

( lAOYPN +JJK 521 

SYNCH 

HAOOZH = 

( IA0ZH+JJ)/52 1 

SYNCH 

MAODZP = 

I IAOZP + JJJ/52 1 

SYNCH 

HAD0ZHN= 

C I AOZHN +JJ )' 521 

SYNCH 

HAODZPN= 

[ lAOZPN +JJ)^521 

SYNCH 




SYNCH 


IF({«.LT. 2>.0R,(K.GT,V M).QR.(L-LT.2).2RCL.- T-U>'' GC TO 10 
RJ = QtKL>6»J) 

XK = (XP-XH)*DYZ 
Y'< = aP-Y«)*CY2 
ZK = (ZP-ZM)*DY.? 

XL = tXPN-XHN)»D2Z 
YL = (f PN->Y.HN)*DZ2 
ZL = XZPN-ZKN)*D23 
X-X(J,n = I YK*ZL-ZK*YL) *RJ 
XXCJ»2) = (ZK»XL-XK*ZU*RJ 
XX(J*3> = C XK*YL“YK*XL> *RJ 

XXCJ»4) = -OMEGft*C ZCKL* J)*XX(J.2)-YCKL» J)*X'f (Jr’) ) 

CONTINUE 

DO 12 J=lrJMAX 

JJ=J-1 

MADDOl = C I AOQl+JJ)/52 1 
HADQ02 = C IA0Q2tJJ)/521 


HADD03 = ( IA0Q3+JJ)/52 1 


SYNCH 


HADDOA 


( lA034+JJ)/52 1 


C 

C* * * ft 

c 


MAD005 = ( IA0Q5+JJ)/5Z 1 

IF( CK.LE.2) .CR.<'..GT.( H).0'< .<L.LE.2>.3R.(L.GT.LH' 
R1 =XX(J,l)ftHOX 
R2 =XX(J»2)«HDX 
R3 =XX( J»3)*H':X 
RA =XX<JftS )*HOX 

ftftftAH ATRX 

RH= 1-/01 
U = 02»RR 
V =» 03*RR 
W = OA*RR 

UU = U»R1+ V*R2«-H*R3 
UT = U**2+ V**2*H »*2 
Cl = GAHI*UT*-5 
C2 = Q5*RR*GAH,NA 
C3=C2-C1 
CA^RAvUU 
CS=G AHI*U 
C6=GAMI*V 
C7=GAHI* H 
OCJrlftl) = RA 


D( J'ftlftZ) 
D(Jftlr3) 
0( J» IftA) 
D( JftlftS) 
D(Jr2ftl) 


RA 

R1 

R2 

R3 

0. 

R1*C1-U*UU 


Figure 3-14. Processor Code for SAM - Subroutine STEP 
(META Assembler ) (Cont) 
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S 3 1 

1U5T BE ZERO ON 

19 76 00 


IF( (K.t.E.2).0R.(K.GT.KM).0R,lL,LE.2).0R.(L.GT.LM)5OTO666 

19 77 00 


CALL BTR i: 2» JM) 

19 7800 

666 

DO 21 J = 2»JH 

197900 


SI = FtJ.n 

19 8000 


S2 = FtJr2) 

198100 


S3 = FtJ»3) 

19 8200 


S6 = rtJ»4) 

19 8300 


S5 = Ft J»5 ) 

19 8301 


HADDSl = t IA0Sl*JJ)/52l 

193302 

SYNCH 


19 8303 


MADDS2 = C lA 0S2+ J J)/52l 

198306 

SYNCH 


19 8 305 


NADDS3 = t IA0S5«-JJ)/52l 

193306 

SYNCH 


198307 


HA0DS6 = t 1A0S6+JJ>/521 

19 8308 

SYNCH 


19 83 09 


HA0DS5 = t IA0S5*JJ)/52l 

198311 

21 

CONTINUE 

19 86 OC 

20 

CONTIN Ui 

213600 


RETURN 

21 8700 


END 


Figure 3-14. Processor Code for SAM - Subroutine STEP 
(META Assembler ) (Cont) 


ORIGINAL PAGB IS 
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188640 Since the PROCEDURE XXMl has been INCLUDED it is 

necesary to perform address calculations for the X, 

Y, Z arrays. In a similar fashion IAJ3XM represents 
the address of X{KL-1,J) or rather X(j31) on cycle 1 
and X(511,l) on cycle 2. It appears that at this 
juncture that one is accessing outside of array 
bounds. Note that in the original FORTRAN (Figure 
3-6) the L and K loops go from 2 to LM and KM 
respectively while the hidden N loop of this Figure 
does not indicate this. Line 189433 of Figure 3-12 
is an IF branch meant to indicate that the code will 
not be executed. In fact a transposition network 
calculation will be made for PE#=0 on an address one 
less than the base address in order to calculate the 
OFFSET. However, because of the K,L calculations 
done in the PE code those specific accesses are not 
performed, i.e., for this case those PE's whose K or 
L value is less than 2 or greater than KM or LM will 
not perform the computation. 

188641-18850 Similar computations for X(KL+ND,J) etc. with J 
always set equal to 1. 

189405 First inner J loop which has been included from 
procedure XXMl. 

189407-189432 Synchronizations and OFFSET computations by 
MOD (Address, 521) 

189500-19440 DO 12 loop with attended accesses of Q matrix 
values 1-5., 
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194500-197000 DO 25 loop with the fetches to Q{KL,J,6), 
Q(KL,J=1,6) and Q{KL/ J+1) ,6) . For simplicity 
additional computations were not made in the N loop 
initiation to specify an OAj2fQ6P(plus) + IAj3Q6M(minus) 
equivalent to the J+1 and J-1 but rather left the 
addition and subtraction to be done in the MOD function 
expressions of line 194527 and 194523. This would infact 
be inefficient as it would be performed for each J. 

The IF branch spanning 194503-194527 has been 
explained in 3. 2. 3. 2. 

197800-198310 DO 21 loop 

198400 End of DO 20 loop - The cycle loop 

In an early analogous manner the Processor Element Code is 
generated. In this case however each processor performs a 
calculation to determine relative address as a function of cycle 
and PE# . 

188602 Calculation of relative address 

188604 Calculation of L value 

188606 Calculation of K value 

188607-188628 Calculation of array addresses in Extended 
Memory 

189405-189461 DO 10 loop included from XXMl procedure. Note 
as the J index increases the array address increases 
by 720. Also line 189433 indicates the "non- 
computation" for undesirable K and L values 

189500-194400 DO 12 loop 

194500-197000 DO 25 loop 

197700 CALL BTRI a SUBROUTINE in the normal FORTRAN sense. 

Its modification into SAM Extended FORTRAN is shown 
in Figure 3-21 to be few Indeed. (A branch around 
BTRI should be explicitly shown similar to line 
189433) 

197800-198311 DO 21 loop 

198400 End of N loop for number of cycles 
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3.2. 3.4 Assembler Code for STEP 


To be supplied in Phase II 

3. 2. 3. 5 Subroutine BTRI - SAM Extended FORTRAN 

As can be seen in Figure 3-19 the comparision of the original 
FORTRAN (Figure 3-17) and the SAM Extended FORTRAN (Figure 3-18) 
only one change had' to be made in the code. This was the LOCAL 
declaration for Named COMMON/BTRID/ . Since no extended variables 
are fetched or Stored in this piece of code it runs entirely 
internal to the processor as written. 

3. 2.3.6 SUBROUTINE XXM and XXMl. 

It was noted in examining the IMPLICIT code that the majority of 
calls to the SUBROUTINE XXM occurred in loops whose initial and 
terminal members precluded taking the branches which occurred in 
this code. (Lines 245800, 245900, 247500, and 247600.) Since 
this reduces the performance of the whole code on both the CDC7600 
and on SAM the code was modified into two SUBROUTINES. One, XXM, 
to be used when the calling loop had initial and terminal values 
and XXMl for those calling loops in which K never equal to 1 or 
KMAX and L never equal to 1 or LMAX. See Figures 3-20 and 3-21. 

Figure 3-22 shows XXMl written in SAM Extended FORTRAN and 3-23 
shows the differences. 

Since this code was brought into STEP via the INCLUDE statement, 
further discussion is not necessary. 
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4 2300 
14230" 
142400 
42500 
•42600 
42700 
42900 
42900 
43000 
43100 
43200 
43300 
434 00 
■43500 
43600 
43'^00 
'43900 
•43900 
144000 
144100 
■44200 
44300 
44400 
44500 
44600 
4470" 
44SOO 
4490" 
45000 
45100 
45200 
45300 
45400 
455 00 
45600 
4 57 00 
4590!) 
4590.0 
4600C 
46103 
46200 
46300 
46400 
46500 
46600 
46700 
46900 
46900 
47000 
471'0 0 
47200 
47300 
47400 
47 500 
47600 
47700 
47800 
47900 
48000 
48100 
48200 
48300 
434 00 
48500 
48600 
48700 
4 8800 
48900 
49000 
49100 


12 


14 


11 


SUBROUTINE BTRKILA.im) 

L0CAL/BTRID/A(60»5»5)»3 (6) ,5 »5 ) » C C5 Or 5, 5 )» D f 6C» 5» 5 F C 6 : » 5 3 
OIKENSION H<5r5 ) 

REAL LllrL21»L22»L 3l»L5 2»L 33 »L4 1» L4 2»L4 3»L44 rL5 I»L52 »L53 »L54 »L5 5 

IL=ILA 

IU=IUA 

IS=IL+1 

IE=IU-1 

INSERT LUOEC 

L11=1./B<IL»1»1 ) 

L21=B( IL»2» 1> 

U12=BCIL»1»2) *L11 
L22=l./( B: IL»2»2)-L2l*J 12) 

U13 = BaLrlr3)*Lll 
U14 = BT lL»l»4)<kLll 
U15=BtILrlr5)*Lll 
L31=B<IL»3»1) 

LS2=BtILr3r2>-L51*U12 
U23=fB( ILr2»3)-L21*U13) *L2 2 
L33=l./< BC U»3> 3)-U13*. 31-U2 3*L32> 

U24=C3( IL»2»4)“L2t*Ul4)*L22 
U25=CBCILr2»5 3-L21*U15) »L22 
L41=BtILr4rI) 

L42=8CILr4 r2)-L41*U12 

L4 3 = B(IL»4 r !)-L41*U13-. 4 2*U2 3 

U34= tBt ILr 3»4)-L31*U14- L32*U24)*L33 

L44=l./( B: IL»4»4)-U14<». 41-U24<*L42-U34 *l4 3) 

U35 = {8( ILr3»5)-L31»U15- L32*U25)*L3 5 
L51=BULr5»l) 

L52=B( lLr5»2)-L51*U12 

L53=BaL»5»3)-L51*U13-L52*U?3 

L54 = BClLr5»4)-L51*U14-. 5 2*U2 4-L53»U 3 4 

U4 5=(B( ILr4»5)-L41*U15*L42»U25-L4 3*U35)*L4<4 

L55=l./( S[ IL»5r5)-L51*J 15-L5 2*U 25"L 5 3*U 35-^54*0 5) 

COMPUTE LITTLE R S 
D1=L11*F(IL»1 ) 

02-L22*(FI IL»2)-L21*01) 

0 3=L 33*tFC IL»3)-L31»0t-L32*D2) 

D4 = L44*< Ft IL»4) -L4 1*D1*L42<*02-L43*3 3 ) 

D5=L55*{FC IL»5)-L51*D1- L52'*0 2-L5 3*0 3-L5 4*3- ) 

COMPUTE BIG R S 

FCIL.5) = 05 

F(IL»4)=04-U45*05 

FtIL»3) = 03-U34*FtILr4>- U35*05 

F(IL»2')=D2-U23*F<IL»3)*U24*F{IL»41“U25*05 

Ft IL»1) = D1-U12*F{ IL»2)- U13*F(IL»3)-U14*FtILr4)-0' 5*05 

COMPUTE C PRIME FCR FHST RQ). 

DO 12 M=l,5 

01=Lll*cnLrlrM) 

D2*L22*t C( IL»2r M)-L2l*3 1) 

D3 = L33*(C; IL»3»M)-L31*3 1-L32*D2 ) 

D4=L44*tC( IL#4»K)-L41*3 1“L 4 2*02-L 4 3 *0 3 ) 

05 = L55*t Ct IL»5»H)-L51*3 1 -L 52 * D2 -L53 * D 3 -l54*04 1 

B(IL»5rH)=05 

Bt IL»4» H) = D4-U45*D5 

B(IL»3»M) = D3-U34*B<r. .4»M)-U35*D5 

B(ILr2rM) = 0Z-U23*BtIL»3.M)-U24*B( IL»4f M)-U25*C5 

S(IL*lrK) = Dl-U12»ea. »2rM)-U13*3CIL»3»M)-U14*9(!L»4rH)-U15*C> 

00 13 I = IS»1E 
COMPUTE e PRIME*0IGR 
DO 14 N=lr5 

Ftif N) = F(IrN)-A<IrN. 1)* Ft I»l,n- At I, Nr 2) *FC I -Ir 2)- At I »Nr 3) «F{I-lr 
*)-A(IrNr4)*F<I“lr4)-A(t .Nr5) *F( 1-1.5) 

COMPUTE B PRIME 
DO 11 N=lr5 
DO 11 M=l,5 

HtN. H) = BtI .N.H)-AtI »N.l )*3tl-l. l.rMl-ACi .Nr2) *Bt I-1.2.M )-A( IrN.3 )* 
»B(I-l»3.H)-AtI.Nr4)*BCI"lf4.l<)-AtI. N.5)*BtI-l»5.M) 

INSERT LUOEC AGAIN 
Lll=l./Htl rl) 


Figure 3-17. Original FORTRAN - Subroutine BTRI 
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tiise ■ 

ii9<*09 
49500 
49600 
49r00 
49500 
49900 
50000 
50100 
50200 
50300 
50400 
5 05 00 
5 06 00 
50700 
50800 
'509 00 
51000 
51100 
51200 
5 1300 
51400 
51500 
51600 
51700 
<51900 
51900 
5 2000 
52100 
.52200 
52300 
52400 
52500 
52600 
5270C 
5280C 
52900 
53000 
53100 
153200 
153300 
153400 
'53500 
I5360f' 
■53700 
53800 
53900 
54000 
54100 
54200 
54300 
'54400 
'54500 
•54600 
54700 
■54900 
54900 
55000 
'55100 
•55200 
'55300 
1554 00 
155500 
155600 
■55700 
'5580C 
55900 
'56000 
•56100 
56200 
56300 
564 00 
56500 
566 00 
56700 
56800 
56900 
5-7 000 
57100 
57200 




15 

13 


17 


18 


I *L11 

L22=l./< H( 2»2)-L21*U12l 
U13=H(1 f 3)*L11 
U14=H{1.4)*L11 
Ul5=Htl*5)*Lll 
L31'=H( 3»1) 

L32=KC3>2)-L31*U12 

U2,3=<H(2^3)-L2t*U13)'>L2 2 

L33 = l./CHt 3,3)-U13*L31*U23»L32) 

U24 = <HC2j. 4)-L21»U14 )*L? 2 
U25= CHt2,5 )-LZl »U15 )»L? 2 
L41=HC4» 1) 

L42=H(4f2) -L41»U12 
L43=HC4»3)-L41*U13-L42* U23 
U34=<Ht 3»4 )-L31*U14-L3?»U24)*L33 
L44=1./(HC 4»4)-U14*L41-U24*L4Z-U3 4*L4^) 

U35={H( 3»5)-L31*U15-L3?»U25)»L33 

L51=Ht5»l) 

t52 = H{5»2)-L51'»U12 

L53 = HC5»3)-L51*U13-L52»U23 

L54=H(5»4)-L51*U14-t52'>U24-l.53*U3 4 

U4 5= CHC4,5)-L41*U15-L4>*U25-L4 3*U35)i*L4 4 

L55=1./(H( 5. 5)-L51*U15-L52»U25-L5 3<»U35-L54*U45) 

COMPUTE LITTLE RIS 

oi=Lii*Fa»n 

D2 = LZ2*CF[ 1.2 )-L2 1*01) 

0 3=L 33*<Ft I .3)-L31*01“. 3 2*02) 

D4=L44*CFt I»4)-L41*Di-k42*D2-L43*a3 ) 

05=L55*(FI I.5)-L51*Dl-.52*02-L53*D3-L54*D4 ) 

COMPUTE BIG RIS 

Fa.5) = 05 

Fa»4)=04-U45*05 

FCI.3) = D3-U34*FCI.4)-U3 5*D5 

FCI .2) = 0 2“U2 3*F( 1 . 3 ) “U2 4 »FC T .4 ) -U ’5 *05 

F(I.1) = D1-U12*F< I.2)-Ul 3*FCI.3)-U14*FU .4)-'J15*05 

COMPUTE C PRIMES 

00 15 H=1.5 
D1=L11*CCI.1.M) 

D2 = L22«tCt 1.2»M)“L21*3t ) 

03 = L33*CCC I.3.M)-L31*01-L3 2*G?) 

D4=L44*t CC I.4,M)-L41*01-L4 2*C2-L4 3*D3) 

D5=L55*(CC I,.5.M)''L51*01 -L 5 2* C2- L5 ?• 0 3-L 54* 0 < ) 

B< I»5.M)=D5 

atI.4.M)=D4-U45*D-i 

8(1. 3.M) = 0 3-U34*8(I.» »M)-UI5»95 

B(I.2.M) = 02-'J23»B< 1.5 . M) “U24»9( I. 4 .M) -L?5»D5 

B(I.l.H) = D1-U12*B(I.I . M)-U13*8( I. 3.M)-L14. SCI . M )- 'J1 5*05 

CONTINUE 

1 = 10 

COMPUTE B PRIKE*8IG R FOR L»ST ROW 
DO 17 N=1.5 

F(I. N)=r(I .N)-4(I.N.1)*FI,I-1.1)-ACI»N.2)*F( I-1.2)-ACI.N.3)‘ 

* F(I-1.3)-A( I.N.4)*F(I- 1»4)-A(I.N.5 )*F( 1-1.5) 

COMPUTE B PRIME 
DO 18 N=1.5 
00 18 H=1.5 

H(N.M) = 3tI .N.M)-AC1 .N.l ) *B ( I -1 . 1. M) - A(I . N. 2 ) »3C I“1 . 2 . ■' )- AC I . 
*B(I-1.3.M>-ACI.N»4)»8a-1.4.K)-A{l»N»5)*BCI-1.5.M) 

INSERT LUDEC AGAIN 
Lll=l./HU.l) 

L21=H(2. 1) 

U12=H( 1»2)*L11 
L22=l./(H(-2.2)-L21*U12) 

U13=H(1,3)*L11 
U14=H(1.4) *L11 
U15=H(1.5)*L11 
L31 = H(3. 1) 

L32 = HC3.2)-L31*012 -rr. 

,U25=(Ht2.3)-L21*U13)*LI 2 ORIGINAL PAGE IS 

L33=1.7(HI 3.3)-U13*L31-U23*L2-2) • a 

u24=(h(2.4)-l21*ui4)*l22 OF POOR QUALITY 

U25= (Ht2.5)-L21*U15)*L22 
L41=H(4. 1) 

L42=H(4.2)-L41*U12 
L4 3=H(4. 3)-L41*U13-L42» U2 3 
014=(H(3»4 )-L31*U14-L32*U24)»L35 
L44=1.7(H{ 4.4)-U14*L41-U24*l 42-U34*L4i) 
U35=tH(3.5)-L31*U15-L3?*U25)*LI3 
L51=H(5. 1) 

L52=H(5.2)-L51*U12 


Figure 3-17. Original FORTRAN - Subroutine BTRI (Cent) 
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U4 5= <Ht 4 »5)-L41*U15-L42 *U25-L4 3»U35 )»L44 
L55=l./(Ht 5»5)-L51*UI5-L52*U?5-L5 3»U35-L54»U4 5) 

COMPUTE LITTLE RIS 
01=L11*FCI»1) 

D2=LZ2*< F: Ir2)-L2l*0I ) 

03 = L33»<Ft 1»3)-L31*D1-. 32»D’) 

04=L44*CF( I»4)-L41*Dl-_ 4 2*D2-L4 3*D3 ) 

D5 = L55*C F[ I»5)-L51*Dl-.52*02-L5 3*D3-L54*04 > 

COMPUTE BIG RIS 

Ftl»5)-D5 

F< I»4> = 04-U45*05 

F{I,3) = 0 3-U34*FU»4)-U5 5*05 

F(I»2)=D2-U2 3*F<I»3)-U2 4*FU»4)-U:5*05 

FCI,1)=D1-U12*F(I»2)"UI 3*rCI»3)-U14 *F(I f4)-U15*05 

I = IU 

1 = 1-1 

DO 19 N=l»5 

FCI»N)=F(r»N)-F(I*l»l)*8tI»H.l)-Fn*l. 2)*B(r »V»?)-F(I+: . 
* )-F<I+l»4)*B(I»N»4)-FC I<-1*5)*B<I»N.5) 

IF ( I.GT.IL) GOTOZO 
return 

END 


Figure 3-17, Original FORTRAN - Subroutine BTRI (Cont) 
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SUBROUTINE BTRI(ILA,IU1 ) 

CQHH0N/BTRID/AC60.5»5>. 3 ( S 0 )» CC ' 0, 5 . 5 J »D ( 6Cp5, 5 )• F { 60 » 5 > 

DIHENSIDN H<5»S ) 

REAL m»L 21.L22»LJ1»L5 2 »L 3 3 »L4 1 ,L<i 2 »L4 3»LA4 # L5 1» L5 2 »L5 ! f L5A »L55 

IL=ILA 

IU=IUA 

IS=IL*1 

IE=IU“1 

IMSERT lUDEC 

Ln=l./B<IL»'l»l ) 

L21=8tILF2Fl) 

U12= 0{ ILfI »2) *L11 
L22=1./(B( ILf2»2)-L2I*J12) 

UI3=B(ILf1 f3) »L11 

U14 = B( IL»1.4)»LU 

U15=BC ILf 1>5) »Lli 

L31=B(IL»3»n 

L32=BCIL»3»2)-L31*U12 

023=<Bl IL»2»3)-L21»U13) *L22 

L33=l-/C BC IL»3 f 3)-U13*’. 3I-U2 3*L32 ) 

U?4=(B( IL»2»4)“L2l»U14) »L22 
U25= (B(ILj.2»5)-L21»U15)*L2 2 
L41=0( IL#4f1) 

L42=BCIL»4».2)’'L41*U12 
L43=9CIL»4»3)-L41*U13-.42*U?2 
U34= t9( IL»3 f 4)-L31»U1'4*L32*U24)*L3 5 
L44=l./T 6t IL»4»4)-U14»_41-U2 4‘L42-U34*L43) 

U35-(B( ILf3f5)-L31*U1S-L32*U25)*L33 
L51=B(ILf5»1) 

L52=8<ILf5»2)-L51*U12 

L53=B(ILf5f3)-L51*U13-.52»U’3 

L54 = B(IL»5 f4)-L51*U14-_5 2*U24-L5 3«U34 

U45=(B( IL»4f5)“L41*U15*L42*U25-L43*US5) »L44 

L55=l./< et IL»5»5 J-L51*J 1 5~ L5 2»U25-L5 3*U 35-L5 4 ‘ J . 5 ) 

COMPUTE little R S 
D1=L11*F(ILf1) 

D2=L22»tFC IL,2)“L21*01) 

D3=L'33*(FI IL»3)-L31*01*L32*D2) 

04 = L4 4*(Ft IL, 4) -L41*D1'L4 2-*D 2-143*0 ' ) 

DB = L55»(FI IL»5)-L51*I)1- L52 *02-LS 3*0 3-L5 4*0O 

COMPOTE BIG R S 

FUL»5) = D5 

FCIL»4)=04-U45*05 

F(IL»3) = 03-U34*FUL»4)- U35*DS 

FtIL»2)=D2-U23*FC IL » 3 J* U 24 *FCIL .4 )-U 25* 05 

FaLFn = Dl-UlZ»F<IL»2)' U13*F (IL » 3 )-U 14»F (IL, 4 )-J' 5*05 

COMPUTE C PRIME FOR FITS! ROti 

DO 12 M=l»5 

01=L11*C<IL#1»M3 

02=L22*<a 1L»2»K)-L21*0 n 

D3=L33*tCI IL»3» K)-L31*I 1-L32»D3J 

04=L44*tC{ 1L»4 f H)-L41*J 1~L 4 Z*D2-L 43 *03) 

D5=L55*CCt ILF5rM)-L5l*0 I -L 52*D2-L53 *03- L54*0 4 ) 

B(IL»5»H)=05 
Bf IL »4 .H )=D4-U45*D5 
BCIL»3»M) = 03-U34»B(IL»4»H)-U35*D5 
6CIL»2.M) = D2-U23*9tI.»3FH)-U24*BtILF4Fp)-U25*05 
BULfIfH) = D1-U12*B(IL».2fH)-U13*E(IL»3fM)-'J14*3(ILf4,>0-U15*C5 ’ 
DO 13 I=ISfIE 
COMPUTE B PRIME*BIGR 
00 14 N=1»5 

Fa»N)=F(I»N)-A(I»N»l)‘F(I-l»l)-A(lF Nf 2)*F(I-1f2)-ACI»'Jf3)*FC:- If 

*)-ACIfNf4) *FCI-1f4)-A(1 »Nf 5 ) *F( I-1»5 ) 

COMPUTE B ORIHE 
DO 11 N=1f5 
00 11 «-1f5 

HCNFM> = BaFN,M)-AClFNFl )*BCI-1»1 »P)-ACIfNf2)*S( :-If2f«)-A( IfS'fT)* 

*3(I-l»3FH)-A<I»N»4)*aCE-lF4 FK)-A{ IfNf5)*B(I-1»5fM) 

INSERT LUDEC AGAIN 
L11=1./H<1*1) 


Figure 3-18. SAM Extended FORTRAN Subroutine BTRI 
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•L21 = H(2. 1) 

U12=HC1»25*L11 
L 22=1. 7 ( HI 2.2)-L21*U12> 

U13=HC1. 3)*L11 

U14=H(1»4)*L11 

U15 = HUp 5)*Lll 

L31 = H( 3. 11 

L32=HC5»2)-L31»U12 

U23=CH<2»3)-L-21-*U13)*l» 2 

L33=1,/CH; 3»3)-U13»L 31'U23»L32) 

U24={H(2»4)-L21*U14>*L> 2 
U25=lHf 2.5)-L21*U15 )*L>2 
L41=H(4.1) 

L42=H(4» 2)-L41*Ul2 

L4 3=H(4.3)-;L41*U13-L4 2»U2 3 

U34=<HC3.4)-t31*U14-L32*U24J*L33 

L4 4 = 1.7 (HI 4»4)-U14*L41-U24.»L-42“U34»L4 1) 

U35=(H(3,5J-L31«U15-L1? *U25)*L33 

L51 = H(5» 1) 

L52 = H(5-.2)“L51»U12 

L5 3=HC5. 3)-L51*U13-L5 2» U23 

1.54=W5»4)-t51*0l4“L52»U24-L53»U34 

U4 5=tH14»5 5-L41*U15-L4? »U25-L43*U35 J*L44 

L55=1.7CHt 5»5)-L51*U15'L52*U25-L53*U35“L54*iJ4 5) 

COMPUTE LITTLE RIS 
01=Lll*F(I.l) 

D2 = L22*CFCI».2)-L21*D1 ) 

•03 = L33*tFH»3)-L31*Dl--32*D2) 

04=L44*{ F: I»4)-L41»D1-. 4 2*D2-L4 3*03 ) 

D5=L55*< Ft I»5)-L51*D1-.52» D2 -L5 3* 03- L54 » D4 1 

COMPUTE, BIG RIS 

F(I,5)=05 

F< 1,4) = 04-U45»D5 

Fa.,3) = 9 3-U34»FCI,4)-U5 5*0 5 

FII,2) = 0 2-U2 3*F(I.3)“'J’ 4*F( I .4)-U25*D5 

Ft I .11 = 0 l-U12»F CI,2 )-Ul 3»F( I f 3 1-U14 -Ftl »41-'!l 5«55 

COMPUTE C PRIMrS 

DO 15 M = 1.5 

01=L11*C(I»1fM) 

02=L22*(C: I>2»Ml-L21*Cin 
03=L33*{ Ct I»3.M)-L31*0l-L32*C2) 

D4=L44*(Ct I.4.M)-L41»9l - L4 2* C2- L4 3* D 3) 

D5=L55*t C( I »5»M )-L51»01 -L 5 2* C2- L5 3* D 3-L-54 *0'- ) 

B(I»5»H)=05 

BC I .4.H) =D4-U45»0 

BCI.3.M) = 0 3-U34»3a»4..M)-U 25*D5 

Bt.IrE.M) = D2-J23*BC‘I>5 .M)-U24i3t I.4.H)-U25*D5 

8( I.l.H) = 01-U12-*BCI»’ .HJ-un*9( I.-3»M1-U4»B(I-.4,M)-U1 5*35 

CONTINUE 

■I=IU 

COMPUTE B PRIHE*BIG R F.OR L«ST ROk 
DO 17 N=l»5 

F(I.N)=F(I.»N)-AtI ,N»1)*F(I -l.n-A( I »N . 2 > *F( I -1 . 2) “ At I . N ► 3) • 

* Ftl-1»3)-A( I»N.4-)*F(I- 1-.4J-ACI,N»5 1*FC 1-1.5 J 

COMPUTE B PRIME 
DO 18 N=1.5 
DO 13 H=1.5 

HtN.M)=B<I»N»H)-A(I.N.l)*3Cl-l.l.H)-A(I. N.2 )»SCI-1.?.M)-ACI, 
*B( 1-1.3. H)-A(1.N.4-)*B« - 1 . 4 .> )- At I. N.5 ) »B 1 1 - 1 . 5. « 1 
INSERT LUDEC AGAIN 
Lll=1.7Htl.l) 

L21=Ht2. 1) 

U12=Htl.2)*Lll 
L22=l. 7t HC 2.2 )-L21*U12) 

U13=Htl. 3)»L11 
Ul4=Htl.4)*Lll 
U15=Htl.5)*Lll 
L31=Ht3. 1) 

L32=Ht3.2)-L 31‘012 
U23=<Ht2.3)-L21*U13)*L22 
L33=l./f HC 3-»3)-U13*L31- U23*L2?) 

U24=tHt2.4)-L21»Ul4)*L22 
U25=fHt2.5)-L21*U15)*L12 
L4 l=Ht4. 1> 

L42=Ht4.2)-L41*UlZ 

L43=Ht4. 3)-L41*U13-L42» U23 

U34 = tKt 3.4)-L31*U14-L3>*U24)*L33 

L44 = l./t Ht 4.4)-U14*L41*U24»L42-U34*L4 3) 

U35=tH( 3.5)-L31*U15-L32 »U25)«L3 3 
L51=Ht5.1) 

L52=Ht5.2)-L51*U12 


Figure 3-18. SAM Extended FORTRAN Subroutine BTRI (Cont) 
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L53=H(5»3)“L51*tl5-L52‘UZ3 

C5« = Ht5»ii) -L51»Uli.-t5 2'U24-L5 3»l)3<> 

U45=(H(4 ,5 1-L41 *U15“L4> *U25-L4 5«U35 )*L4‘. 

L55 = i-/C H: 5»5)-L5 I* U 15- L5 2*U 25- L5 3» U 35”L 54 *IJ4 5 ) 

compute: little ris 

01.-Lll»FCI,l) 

D2=L22»(FI I»2)-L21»D1 ) 

0 3=L33*CFt I f3)-L31»D1-, 32*02 ) 

D4=L44»( Ft I f4)-L41*D1-. 4 2*02-L4 3*03 ) 

05=L55»( Ft I»5)-L51»D1-. 5 2*D2-L5 3*03-L54 *04 ) 

COMPUTE BIG RIS 

F<If5)=05 

F(1.4)=04-U45»35 

Ft If3) = D3-U34*FCIf4)-U3 5*05 

FtI*2)=02-U23*F<lF3)-U’4*F(lF4)-U25*D5 

Ftl* D=D 1-UI2*FU»2)-Ut 3*F ( I . 3 ) -U 14 *F( I f 4) “U 1 5*')5 

1 = IU 
1 = 1-1 

DO 19 N= 1f5 

FtlFN)=F(lFN)-F tI + lFl)»B(lFN.l)-FtI+lF 2)*BCI FNF2)-FCI+lF3)*B{lF*iF 
* )-FtI*lF4)*B(lFNF4)-F: I *1f5 )»B(I fNf5) 

IF f I.GT.IL) GOT02-1 

RETURN 

END 


Figure 3-18. SAM Ejctended FORTRAN Subroutine BTRI (Cont) 


1 R 42300 LDCAL/BTRIO/AC j 0 f5f5 )fB{60f5.5)fC(60f5f5)fD(60f5f5 ) fF(&-' .5) 

Figure 3-19. Comparison of Original FORTRAN and SAM Extended 

FORTRAN - Subroutine BTRI 
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SU810UTINE XXM{ HfLA/ JU » J2A ) 

C0MHPN/3ASE/NM4X»JHAX,<KAX,LyAX,J‘', K)*, 

l rOXlrO Y1 pDZ1fND»H.0 2»-.V(5)»FC(?) ..H 3 » AL » »G D ^ C-EGA , H3x f HOY . H C2 

lAVISC. LAwlN,N'P,IVTl»INT2,rvT? 

COH^ON/G E0/N31. NS’* RF^l NTrRxAX^ XR ►X'*AXf ORAO, OXC 
C0«30N/READ/IfiE ADrIDRir»NGRr 
C0H50N/VIS/RE/.3 RfRMUE.l 
CGKH0'^/VAPS/0(7 20>6» 3 3) 

COHMO'J/VARO/S(7?Of5/3 3) 

Ca«XC'J/VARI/Xf720»30)»r( 720.’0).7f7?0^5'') 

COMRCN /VAR3/P( I20»30). XXC60»4)»YYC60f4)»Z2(6'-.’4' 

LEVEL 2. 0,S, X»Y, Z 
COMHOK/CGJNT/NC.HCI 
CCHHCN /FLSH/iX2/0Y2»3Z2 

XI METRICS forme? for A X. L LI^E IN J 


SYMME TRY 


K = H 
L = LA 
J1=J1A 
J J 2A 

XL = CL-1)*N0+K 
DO 10 J = J1»J2 
RJ = OtKL.6»J) 

IFlK.EO.n GO TO 5? 

IFCK. EQ.KMAX) GO TQ 5t 

XK C XC KL + 1* J) -XCKL-1. J ) )*0 Y? 

YK = CYC KL*1»J) -YCKL-l' J) )*CY2 
ZK = CZt KL + 1 >J) -ZCKL-1. J‘) >*0Y2 
GO TO 7? 

CONTINUE 

XK ^ C-3 .‘XC KL» J)+4 .*X: KL+ 1. ^)~X( KL+ 2. J ) I'C' 2 
YK = C-3.*YCKL» J)+4.*Y: *(L+l»J)-YC''L + 2fJ) )*0'-’ 

2K = C-3.*2(KL»J)+4.*Z: KL+1»>)-2(KL + 2*J))«?'': 

GO TO 72 
CONTINUE 

XX = C3. *xc KL, J )-. . *X(XL-1 r J )+XCKL- 2. J) >*3’' 

YK = C3.+YC KLfJ )-4.*YC<L-l»J 1+YCXL-2 + J) )-*0X ' 

ZX = C3.*ZI KL»J)-4 .»Z(CL-l »J)+ZCKL-2>J) )»0Y ' 

72 CONTINUE 

IFCL-EO. 1) GO TO 52 

IFCL. EQ.LHAX) GO TO 5J 

XL = CXf KL+ND»J )-XCKU-Y 0» J) ) »CZ2 
XL = (f ( KL+ND.J)-Y( KL-VO^ J) ) «DZ2 
ZL = CZ(KL+NDrJ)-Z(KL-NO»J) )*C22 
GO TO 60 
CONTINUE 

XL = (-3.*XCKL» J)+4 .«X: KL+NO»J)-XCXL+'’»NO»J) )»C’ 

YL = C-3.*Y(KL» J) + 4.*Y: KL+NO.J)-Y(KL + ?*KO.J) )*0 
ZL = C“3.*Z(KL#J)+4 .*ZC KL + ND»J)-ZCXL+?»NC»J) )*C" 
GO TO 60 
CONTINUE 

XL = (3. »XC KLf J )'4 .»XtCL-N3» >)+Xt KL-2*S'0> J) ) *C7' 

YL = C 3. *YC KL»J )-4.»YC( L-ND» ,)+YC KL-2+N0, J) )*DZ ’ 

ZL = t 3. »ZC KL» J ) -4 .*ZC C L’NO» J) + Z( Kl-2<* V0» J) ) ‘CZ ’ 

60 CONTINUE 

XXCJ,1) = C YK*ZL-ZK*YU * RJ 
XX(J»2) = I ZK*XL-XK»ZL) *R) 

XXCJ»3) = C XK+YL-YK + XD* RJ 

XX<J»4) = -0NE6A»tZCKL> J)»XX(J,^)-YtKL» JJ^XXtJ. :) ) 
10 CONTINUE 
RETURN 
END 


Figure 3-20, Original FORTRAN - Subroutine XXM 
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subroutine XXMO-rLA. JU f J^’A) 

COHMON7 3 ASE/NM4X» JHAX>T «AX >LK4V ,jM,KHrL«»0T^GA»(*'A»G4MI,S“U.FS''4CH 

1 »0X1»DY1 »0Zlf fjD»N02»- V(5 ) , F0( 5 J 0 , AL '>>G')» C7E GA , HO X , HO Y, H C Z 

2»RH. CNBR,PI, I TR» IKVISC* LAHIN»NP. I M 1» INT2.I NT ' 

COHMON/GEO/NBlf NB^»RFR3 NT»R« #X »XR ,X H AX . ORAD. Dx C 
COMKON/READ/IREAO^IVRIT, NGRI 
CDMH0N/VI5/RE»PR» RMUE^T K 
Ci]HM0N/VARS/Q(7 20»6» 30) 

COHHCN/VAR0/SC72O»5»3 0) 

CaHHCN/VARl/X(7 20»3C,).r (720. 30 )»Z(7 20» 3 •■ ) 

COMKGN /VAR3/“< 120»30). X X ( 6 0 > 4 ) , Y Y( 6 0» 4 ), ZZ( &''► 4‘ 

LEVEL 2»0»S»X»Y.Z 
COHHON/COIJNT/NC^NCI 
COMMON /FLSH/DX2,OY2»OI ? 

XI METRICS FORMED FOB A K. L LINE IN J 


SYMMETRY 


K = M 
L=LA 
Jl= JIA 
J2=J2A 

KL == <L-1)*ND*N 
DO 10 J = Ji»J2 
RJ = 0 Ckl.^6»J) 

XK = CXtKL+l»J)-X{KL-l* J))*CY2 
Y1 = <r(KL + l»J)-Y(KL-l. J))»CY2 
ZK = <ZtKL+l»J)-ZCKL-l» J))*DY2 
XL = {X( KL + ND.^J)-X< KL“'t0pJ))*DZ2 
YL = <Y t KL* NO. J )-Y{ RL-Y 0» J ) ) »CZ2 
ZL = <Z<Kl*N0»J)-Z(KL-<D»J))*CZ2 
XX<J»1> = C YF»ZL-ZK*YL) *RJ 
XX(J»2) = ( ZK*XL-XK*ZL) »RJ 

XX(J»5) = T XM*YL-YK*XL)*RJ 

XXCJ»4 ) = -OMEGA* (Z(KL. J)*XXtJ, 2 ) -Y ( KL» J )*Xx ( J * ' ) ) 
10 CONTINUE 
RETURN 
END 


Figure 3-21. Modified Version of Subroutine XXMl for Improved Performance 

on Serial or Parallel Machine 
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PROCEDURE XXMCH,LA»J1A» J2A ) 

GCOBAL/B A5E/NMAX» JHAX»< H AX* L PAX, JM, KM» L H»G AH MA»GA h I . S« U. FSk sCH 
1 ,0X1,0 Y1,OZ1,NO*ND2,' VC5),FD(5) ,H0 , AL P,GD, GHEG A , HOX , HOY , H C2 

2,RH,CNBS,»I, INV ISC,LA*« N,NP 
GL0BAL/GE0/N8l,NB2,RFR) Nr,R«AX, XR,XHAX, ORAD. OXC 
GLOBAL/READ/IRE AO, IXRir , NGRI 
GLOBAL/V IS/RE,?R,RMUE,R K 
EXTENDE0/VARS/aC720,3Q, 6 ) 

EXTENDEO/VARO/S(72O,30, 5) 

EXTENDED /VAR 1/X< 720, 30) , Y( 72C, 30) ,Z( 720,30) 

LOCAL/V AR3/PU20, 30 ) , X< < 60 ,4 ), Y Y( 60 , 4 ), ZZl 6 ’ ,4 ) 
level 2,e,S,X,Y, Z 
CONTROL /CO UN T/NC, NC 1, OT 
GL0BAL/FLSH/DX2,0Y2,DZ> 

XI metrics fORMED FOR A K< L LINE IN J 


symmetry 

K = H 
L=LA 
J1=J1A 
JZ=J2A 

KL = tL“l)«ND + X 
DO 10 J = J1,J2 
RJ = 0UL,6,J) 

XK = tX( KL»1,J)-XCKL-1, J))*0Y2 
YK = triKL*l,J)-Y<KL-l, J))*DY2 
ZK = IZ(Ki.+ l,J)-Z<KL-l» J))*DY2 
XL = (X(KL*NO,J)-X( XL-VD,J))*0Z2 
YL = Cr ( M + ND,J )-Y< KL-'( 0, J) ) *DZ2 
ZL = CZ<KL + N0,J)-ZCKL-VD,J))*0Z2 
XXCJ,1) - I YK*ZL“2K»YL> *RJ 
XX<J,2) = ( ZK*XL-XK,ZL) *RJ 

C XM*YL-YK*XL) *RJ ^ 

-0MEGA«(Z<KL, J)*XX{J, 2)-Y(KL,J)*XX (J, <) ) 


10 


XXtJ,3) = 
XX(J,4 ) = 
CONTINUE 
RETURN 
END 


Figure 3-22. SAM Extended FORTRAN for Subroutine XXMl 
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Figure 3-23. Comparison of Modified FORTRAN and SAM Extended 
FORTRAN for Subroutine XXMl 
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3.2.4 Subroutine STEP (Loop DO 30 & DO 40) 


The arrays Q and S which have been declared to exist in Extended 
Memory have the following extents 

0(720,30,6) 

3(720,30,5) 

A partitioning in effect of the first extent of 720 into 2 parts 
occurs at run time with the variable ND. The first index then has 
an extent ND and the second index has an extent equal to LMAX. 

This means that if ND*LMAX 720 certain memory locations are not 
utilized. This causes some degradation in performance for the SAM 
in all three access modes. 

Each of the three types of accesses of the Q & S arrays which are 
required by the DO 20, DO 30 and DO 40 loops in SUBROUTINE STEP 
will be discussed. Because of a complex first order linear 
recurrence the index J in the DO 20 loop must be done serially 
while the K&L indices are parallel (see example below). Similarly 
for the DO 30 loop K is the serial index while J&L are the 
parallel ones. For DO 40 L is the serial index and K&K the 
parallel ones. 

An example of the structure of the program is given below. 

DO 20 L=2,« Ojj 

DO 20 K=2,KM 

DO 18 J=l, JMAX QtMXjry- 

KL = (L-1)*ND+K 
RR = 1.0/Q(KL,J,6) 

(plus many other statements including a complex first order 
recurrence in J) 


18 CONTINUE 
20 CONTINUE 
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This is a Case I access as described in Appendix A. The ISKIP=ND. 
For ease in handling this generality o£ splitting the first extent 
it is assumed that 720/ND is integral with value ND. The number 
of cycles necessary to access the L's and J's is equal to 

No. of Cycles = {LD*30+512-1)/512 

For the specific case given in the benchmark where ND is equal to 
15 then LD is equal to 48 and the No. of cycles equal to 3. 

On cycle 1 one is accessing all L's from 1 to LD and J's from 1 to 
10 and for the 11th J one is accessing L's from 1 to 32. This is 
done for each K from 1 to ND. Figure 3-26 maps this accessing of 
indices from Extended Memory into the processors. 

The last loop, the DO 40 Loop has the L index as the serial index 
for the recurrence relation and the K&J indices as the parallel 
ones. The structure is 


DO 

40 

J=2, 

JM 

DO 

40 

CN 

1 

KM 

DO 

38 

LI, 

LMAX 

LK 

= 

(L-D* 

ND+K 

RR 


1.0/Q( 

KL,J 


(plus many other statements including a first order linear 
recurrence in L) 

38 CONTINUE 
40 CONTINUE 

This can be considered to be a Case II or Case V accessing pattern 
as discussed in Appendix A. Since the accessing of Q & S is 
identical a "semi smart" compiler can chose which of the two cases 
it wishes to consider this. I.e., Q(KL,J,6) can really be 
represented as Q(K,L,J,6) with K varying from 1 to ND, L from 1 to 
LMAX and with J varying from 1 to 30. Since both J&K are totally 
parallel and all access to Q&S are in the same sense of K,L,J the 
"semi smart" compiler can pick which way to do it. In this case 
because ND is unknown at run time it would pick Case II. 
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The memory layout is shown in Figure 3-24. The accessing pattern 
is described in Appendix A as being of Type 3. This means that 
the SAM will access 512 elements of the Q array at one time for 
J=l, then 512 for J=2 etc,, until J=30. This would mean all K's 
would be accessed from 1 to ND up to an L value L(last) such that 
512 values are accessed. 

For example if ND=10 then 52L values would be accessed each for K 
values 1 to 10 except for L=52 which would only access K=1 & K=2, 

On the next complete cycle those remaining K and L values would be 
accessed up to a maximum of 720. Figure 3-25 shows thus. 

As can be seen this could be inefficient if ND*LMAX < 512 and 
these parameters were set at run time. A more efficient procedure 
could be worked out which would have the same flexibility, either 
by recompiling with compile time parameters or else with more 
efficient coding to permit compaction of the Q array (see Appendix 
C). 

The next loop DO 30 has the K index as the serial index for the 
recurrence relation. Its structure is 

DO 30 J-2,JM 
DO 30 L=2>LM 
DO 28 K1,KMAX 
KL-(L-l) AND +K 
•RR = 1.0/Q(KL,.J,6) 

(plus many other statements including a fir 
recurrence relation on K) 

28 CONTINUE 
30 CONTINUE 
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Figure 3-24. Memory Layout for Q Array 
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Figure 3-26. Processor Index Values as a Function of Cycle 
, DO 30 Loop Subroutine STEP (]ND=15) 


























Figure 3-27 shows how the indices will appear in the various 
processors. This case requires subiterations of the cycles as on 
page A-10. The number of cycles is equal to (ND*30+512-1)/512 
which for an ND of 10 means only one cycle. ISKIP=720, 

3.2.5 Functions and Macros) 

Functions on the FHP will include not only the mathematical 
intrinsics, such as ARCTAN, LN, EXT, and SQRT which are expected 
of any compiler, but also a family of functions that are brought 
about because of the parallel nature of the FMP. 

Math Intrinsics 


Math intrinsics (ARTAN, LN, EXP, SQRT) are well understood. Some 
will be in-line code, some are subroutine calls. All execute 
locally to the processor. Since there is nothing new or different 
for the FMP, we need not digress to discuss them at this point. 

Global Intrinsics 


A form of intrinsic function seen in a parallel language, for 
which there is no analog in a serial machine, is that function 
which operates across the declared parallelsim. A global sum is 
the sum of all the elements specified by all the instances of the 
index set of the DOALL. A global maximum is the largest element 
across the entire DOALL. 

To reduce compiler complexity, and to eliminate user programmers’ 
doubts as to whether parallel operation has been achieved as a 
result of compiler analysis, global intrinsics will be supplied. 



CYCLE'1 



Figure 3-27, Processor Index Values as a Function of Cycle 
DO 40 Loop Subroutine STEP (ND=10) 
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To replace the following serial FORTRAN 


A = 0.0 

DO 1 J = 1,1000 
A = A + B(J) 

1 CONTINUE 

the language will allow: ORIGINAL PAGE IS 

OF POOR QUALET^i 

DOALL, J=l,100 
A = GLOBALSUM(B( J) ) 

ENDDO 


The global operations will presumably include all of the fol- 
lowing. Assume that we are inside a DOALL loop expressed as 
DOALL, J=JSTART,JEND. 

Function Definition 

JEN D 

J=JSTART 

JEND 

TT A(J) 

J=JSTART 

Largest of A(JSTART, 

A( JSTART+1) , . . . A( JEND) 

Smallest of all A(J) 

JSTART J^ JEND 

Global functions are logarithmic in efficiency, that is, it takes 
nine steps to produce the 512-way sum across the 512 processors in 
one cycle. When the result (such as "A"), is a LOCAL variable, it 
is produced across the entire extent of the DOALL. 


GOLBALSUM(A( J) ) 


GLOBALPRODUCT(A{ J) ) 


GLOBALMAX(A( J) ) 


GLOBALMIN(A( J) ) 
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An extension of the global operation is the formation of a 
parallel linear recurrence in nine {= log2512) steps as demon- 
strated by Shyh-Ching Chen in his doctor's thesis at the D. of 
111. In Fortran, consider 

DO 1 J=l,1000 

A(J+1) = B(J)*A(J) + C(J) 

1 CONTINUE 

This takes 1000 steps, each with one multiply, and one add. A 
parallel algorithm exists that produces the same result in 10 
steps. The parallel algorithm can easily be implemented on the 
FMP. 


With the inclusion of the parallel linear recurrence as a function 
in the language, the programmer has two ways of writing his linear 
recurrences. For example, given the serial FORTRAN 

DO 1 J=l.,1000 
DO 1 K=l,1000 

A(J,K+1) = A(J,K) * B(J,K) + C(J,K) 

1 CONTINUE 

there are two ways to write it in FMP FORTRAN given that the order 
of nesting the loops is irrelevant otherwise. Namely; 


ORIGINAL PAGE IS 
OP POOR QUALITYl 
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Method I; 


DOALL, J«l,1000 
DO 1 K=l,1000 

A(J,K+1) = A(J,K) *B{J,K) + C(J,K) 

1 CONTINUE 
ENDDO 

Method II: 

DOALL, K=l,1000 
DO 1, J=l,1000 

A{J,K+1) = RECURRENCE( A(J,K) * B(J,K) + C(J,K)) 

• 1 CONTINUE 

ENDDO 

Method X; which executes the recur rence. sei ially in an inner loop, 
runs about nine times as fast as method II, which executes each 
one of the recurrences in parallel across each value of J in turn. 
That is, method I is 512 times as fast as a serial machine, while 
method II is 57 times faster than a single serial processor. The 
RECURRENCE function is included only for those cases where method 
I is not an available option. 
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CHAPTER 4 


SIMULATION 


4. 1 SIMULATION GOALS 

The simulation effort during this extension of the feasibility study has two 
distinct goals. The first is the requirement of the statement-of-work for 
this extension that a simulation of the PMP be prepared, and at least one 
simulation run. The second, is to get a head start on those simulations needed 
for phase II, and described in Chapter 6 as the mechanism for settling various 
trade-offs. The statement of work also calls for the selection of "metrics”, 
that is, selected portions of the benchmark programs to be. used as inputs to 
the simulations to measure the performance of the projected FMP. 

Detailed instruction by instruction timing of code execution in CU and EU is 
necessary to ensure that the required throughput can be achieved. The design 
of major system components must be specified in siafficient detail to provide 
structure, logic, and timing parameters for system simulation. This infor- 
mation is in Chapter 2. 

CompilerTunctioning, including FORTRAN extensions for the FMP, are also 
needed and are found in Chapter 3. Hand compilation methods must be specified. 

In the case of the current extension, a single metric, subroutine TURBDA, has 
been selected and hand compiled for this purpose. Further definition of hand 
compilation is needed for phase II. In particular, how much compiler sophistication 
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will be achieved in the first version affects hand compilation, and this is still 
a subject for discussion. At this time it is best to make conservative assump- 
tions, again in order to reduce the element of risk in the simulated system 
performance predictions. 

The design details and design choices outlined above have been made definite 
though at this time for the first of the detailed simulations which are required 
to establish confidence in the feasibility and throughput capability of the SAM 
architecture. Any or all of the details may be changed as a result of further 
study or the availability of more advanced components. Of course, all such 
changes would be supported by simulation studies to maintain or increase 
confidence in the correctness of the system design. 

4. 2 SELECTION OF METRICS 

It is Burroughs understanding that the final selection of metrics will be the 
Government's. Metric selection is a function of the architecture that is to 
be measured. For example, in a conventional serial uni-processor, the 
distinction between "serial" and "parallel" streams of code is irrelevant, 
and should have no bearing. With parallel processors such as the two designs 
being proposed for the FMP (NAS2-9456 and NAS2-9457 final reports) the 
arrangement of data in memory affects the efficiency of parallelism, and metrics 
should be selected such that all "directions" of access of that data are represented. 
What is important is that the metrics selected be "representative", both with 
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respect to the operations being performed by the target architecture, and the 
codes that will be run on the FMP. Some "representative" of every kind of code 
that the FMP will run is wanted, but the results should be weighted according 
to the expected frequency of each "kind. " "Kind" refers to the sort of inter- 
action with the architecture that is represented, whether parallelism is two 
dimensional or one-dimensional, the direction of accessing, presence or absence 
of branches in inner loops, and so on; all the things that may have an affect on 
the way the selected architectures behaves. 

The metric that has been selected as the one that shall be used in the single 
simulation that will be run during the extension of the contract is SUBROUTINE 
TURBDA. Like most of both the implicit and explicit codes, it exhibits a 
great deal of- parallelism, but with some operations conditional on subscript, 
so that different things are being done at different subscripts. It thus tests 
the architecture's ability to do different things at different grid points. It 
includes fetches from, and stores to, the program's data base (in extended 
memory), exercising the data transfer paths from the program data base to 
the processing resource proper. It contains sufficient arithmetic manipulation 
to exercise that aspect of the FMP (although probably less than a "typical" 
subroutine). It contains significant. amounts of index computation both on 
loop controls and on subscripts. For the FMP design of Reference 1, it exer- 
cises the synchronization, which is an essential feature of that design. 
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4. 3 SIMULATION MODELS 

The NASF system simulation modeling will be done at three levels of detail, 
with results from a detailed model being used to determine parameter values 
for the next higher level model. 


The most detailed level of modeling is the instruction timing model for CU 
and processors. For example, the model for the processor has as resources 
the PDM, PPM, instruction registers and decoding, multipliers, adders, data 
and index registers, etc. , corresponding to the detailed processor design. A 
metric for this mpdel is a sample code sequence generated by hand compilation 
of a FORTRAN section typical of the Navier-Stokes codes. Each instruction 
is modeled by a sequence of tasks, each requiring one or more of the resources, 
and executing for the specified number of clocks. Instruction fetch and decode 
is such a task sequence and the' extent of overlap with instruction execution is 
modeled. Similarly, the extent to which instructions can overlap is modeled 
by the use of queueing for resources, or by logic tests, in exact correspondence 
with the processor design. The output reports from running this model can be 
used to determine parameters for the next level model. An important perfor- 

• ^ » f 

mance factor to be determined is the extent to which the address calculations 
for EM accesses can be interlaced with, and overlapped, by the floating point 
calculations. The next level of simulation will be the flow model processor, 

\ 

including the CU, processor, EM, and DBM. The interactions to be measured 
are the CU and processor code execution times (previously determined), and' 






0 ? 
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data transfers between EM and DBM. The metric will be a sequence of code 
executions and data transfers approximating the main body of computation in 

'i 

a Navier-Stokes code. The results will show the throughput performance of the 
FMP, together with the utilizations of E,M and CU, which interface with DBM 
and the rest of the system. 

I 

When we wrote the simulation model, we found that the instruction level .model 
needed to include the interaction between CU and EU, combining the first and 
second levels. The lowest level simulation model therefore is detailed to the 
instruction level, but includes CU, a number of processors, and access and 
data transmission timing of the Extended Memory and Transposition Network. 
Simulation of a number of selected code sections on this model will provide 
the parameters required to model the execution of complete jobs and sequences 
of jobs through the Facility, 

The overall system model will include the host. File Memory, Data Base 
Memory and their interfaces with each other and CU and EM. The metrics . 

I 

will be presumed scenarios of. user requests for NSS jobs. The sequence of 
scheduling, initialization, NSS operation, and output will be modeled; Impor- 
tant functions to be modeled are data base and program transfers from File. ' 
Memory to DBM to EM, CUM, and PDM, allocation of DBM space, the . i 
sequence of FMP operations, including data and program input, computation,' 
snapshot and data outputs, and changeover to the next job. Only the FMP 
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scheduling and control load on the host will be modeled; the amount of host 
capacity available for other necessary work can be measured, or the host can 
be loaded to any desired level by undefined "background" jobs and the effect 

i 

on NASF throughput measured. 

The overall simulation effort will have two functions; first to support the validity 
of the SAM architecture by modeling all essential system functions and inter- 
faces in sufficient detail and demonstrating proper function of the model, and 

I 

second to show -the throughput capability of the system for aerodynamic simulation 
jobs by tracing the throughput step-by-step from the instruction level to the 
user interface. ■ 

Simulations will be written in Burroughs Operational System Simulator (BOSS) 
a discrete-events simulator whose input language is the flow-graph of the pro- 
cess being simulated. The instruction level simulation of Section 4. 5 is written 
in BOSS, the second and third level simulations of Phase II will be written in 
BOSS, In Phase II, the instruction-level simulator may be rewritteh in ALGOL, 
since substantial improvement in simulation execution time is expected, 

4. 4 BOSS SIMULATOR 

The BOSS simulator was used for the simulations because of the relative ease 

1 

of modeling with BOSS' and the short, time available. Special timing simulator 
programs for EU and CU code execution probably could have been completed in • 
three months. Simulations at different levels of detail will be used to get perfor- 
mance predictions ranging from the EU instruction execution to the user interface 
level. 
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A discrete events simulator, such as BOSS, models the activity of a system 
as a definite sequence of states. The model changes state only at discrete points, 
called events, which occur at definite instants of time. Every event can be 
predicted at the occurrence of some prior event, and the new state of the 
system model resulting from each event can be completely determined from 
that event and the prior state. In practice the event prediction and state change 
calculations are often probabalistic, because the real system is too complex 
to be modeled in full detail. The state variables of the model are mostly binary 
logic variables such as busy/hot busy or happened/not happened, and processing 
of an event involves the accessing of state tables and evaluation of binary 
decision functions. Arithmetic operations rarely occur except in the evaluation 
of continuous probability functions where they are used, in the binary decisions 
or in predicting the times of future events. 

The BOSS simulator program runs on a B 6700 or B 7700 Burroughs computer. 

It is a general purpose discrete events simulator, witli emphasis on ease of 
modeling and efficiency in execution, in exchange for some restrictions on the 
size and generality of models. BOSS has been used by the Federal and 
Special Systems Group at Paoli mainly for simulating the hardware and software 
functions of data processing systems, and improvements and enhancements over 
several years have made it especially useful for this purpose. 
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In a BOSS simulation the element of model activity is a TASK. A task is 
characterized by its requirement for resources and by the algorithm specified 
for predicting its execution time. A task is initiated upon completion of its 5 
predecessor requirement, which is usually a logical combination (AND or OR) 
of one or more prior task enciings., The task may wait iri queue imtil the required' 
resources are available; the selected resource units are then made busy for 
the execution time. At the task ending event, resources are released, queues 
are served, and the predecessor requirements of. any successor tasks are • 
updated. Several kinds of test' and -branch constructs are available to cause 
conditional selection of one out of two or more successor tasks. 


The direct interaction of tasks is restricted to structures of tasks grouped 
together and called PROCESSES. When a process is initiated, one or more 
"starting tasks" within it are initiated without predecessors, and the activity 
within it passes from tasks to task until such time as ther e is no further task 
activity, when that active version of the process ends. Except for competition 
for resources,, and certain special constructs, there is no interaction between 
the active tasks in separate active processes. 


The static structure of a BOSS model is described by the structures of the tasks 
and their interactions within processes and by the numbers and kinds of 
resources available. The dynamic state of activity is described by the states, 
of. activity of processes and tasks. Every task is a member of some process, 
and there is no activity in the system model until some process is initiated. 
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Initiation of processes at specified times corresponds to external loads causing 
activity in the system. Processes can also be initiated as subroutines, or by 

I 

task endings in other processes. Many processes can be active concurrently, 

T 

including multiple but distinct and independent versions of the same process. 
Similarly, within a process, many tasks may be active in parallel, including' 
multiple independent versions of the same task. Thus, it is easy to model a 
highly parallel system with many concurrent activities, including cases where 
many of the parallel activities are very similar in structure. 

The basic BOSS structure described above is sometimes inadequate or 
inconvenient for modeling some parts of the system, llierefore, there is 
available a superposed structure of local and global variables upon which 
arithmetic operations can be performed at task endings. These variables can 
be addressed directly or indirectly, and their values can be used to control 
branching at task endings or to modify the resource requirement or execution 
time of specified tasks. This extension permits a certain amount of programming 
of capabilities not available in the basic BOSS structure. In this way, for . 
example, the activity in one process can be influenced by actions occurring 
in another process. 

Figure 4-1 shows graphically the process of implementing and debugging 
simulations in BOSS, showing the various steps that the simulation programmer 
and the BOSS simulator go through in achieving the final result. 
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4. 5 SIMULATION MODEL FOR THE CURRENT STUDY 

The overall structure of the model is shown in Figure 4-2, The Control Unit 
and Processor models are driven by code files prepared by hand compilation 
of a selected FORTRAN code segment. All the operators of CU and EU are 
modeled in detail so that any code may be simulated. Additional operators 
may be easily added if needed. Conditional branching cannot be modeled in 
complete detail since the model is a timing model, and does not simulate 
the processing of data. ’ Such branches are therefore modeled by specifying 
the number of times one path is taken for each time the other is taken. The 
count can be specified probalistically. For most branches this will do well 
enough; The cases where branching depends on the Processor Number, will 
be handled by a later extension. 

The Control Unit model includes its processor, a single memory (CUM), and 
seven of the control functions interacting, with the processor EU's, as shown. 

Any desired number of processors can be modeled, but the number actually 
used will be small (4 to 10) to avoid excessive machine time to run the simulations. 
Details of instruction overlap in the CU are not modeled? instruction execution 
times are not allowed to overlap, but CUM data fetches or stores can overlap 
this execution time of prior or following instructions. A data fetch of one 
instruction must come after a data store (if any) of the preceding instruction,.- 
In case of contention for CUM by program fetches, the data accesses have 
priority, but do not abort program fetches already in progress. The program 
look-ahead stack has a capacity of four code segments,, which is two memory 
words for opcode formats using 24-bit segments. 
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Each Processor consists of an Execution Unit (EU) and separate program and 
data memories (PPM and PDM). The EU is modeled in some detail in order 
to- properly simulate instruction overlap, as shown in Figure 4-3. The 
operation is as follows: 

4. 5. 1 Program Fetch. The Program Counter (PCR) addresses the next 
instruction, which is available at PPM three clocks after the address is available; 
As soon as a full word of program stack is empty, the next code word is read 
to the stack from PPM, and PCR is incremented. When a branch occurs, the 
program stack is emptied and the new code word is available three ‘docks after 
the new PCR is set. 

4. 5. 2 Scoreboard. Each instruction, records in the scoreboard the times- at 
which it will release each resource that it will use. The next instruction must 

• j 

wait in stack until all resources that is will need will be available when needed. 
The Scoreboard and Decoding are modeled logically, but not as resources for 
which- there could be queueing. 

4. 5. 3 Holding Registers. If any resource is required at a time later than 
instruction start,, that instruction must wait in the corresponding Holding 
Register. If that Holding Register is tied up by the previous instruction, then 
the current instruction must wait, even -though it could, otherwise start. 
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Figure 4-3. Execution Unit Model 












4. 5. 4 Integer Processing, Floating Point Processing, PPM (IP, FP, These 

are modeled as resources, although the Scoreboard should assure that there 

j 

will be no queueing for them. The utilization of these resources will give in- 
formation about the efficiency of overlap and the fraction of elapsed time that 
the FP is in use. 

4. 5. 5 Synchronizing Controls. The timing of synchronizing controls is assumed 
to take 3 clocks for a round trip from CU to EU and back to CU. This is modeled 
as no delay from CU to EU since the control signal arrives at the same time as 
the corresponding clock pulse from the central clock. The 3 clocks delay is then 
all in the return path from EU to CU, The actions of the Synchronizing Controls 
are as follows: 

4. 5. 5. 1 READY. The CU raises the ready at the proper time in synchronized 

instructions where the EU's must wait for CU action before proceeding (LOADEM, 

STOREM). Any EU which reaches such an instruction before CU waits for the 

READY level. CU will wait for {IGH and EN) and then turn off the READY level. 

I 

4. 5. 5. 2 (IGH -f EN), This is level equivalent to a logic function generated as 

follows: When an enabled (EN) EU comes to the proper point in a synchronized 

) 

instruction it raises the output line corresponding to I Got Here (IGH). This 

r 

same level is raised all the time an EU is disabled (EN), hence (IGH + EN), 

IGH is turned off by GO from CU, The (IGH + EN) lines for all EU's are ANDed 
at the CU to procude its (IGH + EN) input. In the model this logic function is 
performed by maintaining separate counts of the number of EU's enabled 
(#EN) and in the I Got Here state (IGH + EN) is true when #EN = #IGH. 
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4. 5, 5. 3 EN. EN is ture when #EN=0 (no EU's are enabled). 


4. ,5. 5. 4 GO, When (IGH + EN) becomes true at the CU, it raises the GO 
level for one clock. All enabled CU's, on receipt of this signal, turn off the 
IGH level and continue the instruction in which they were waiting, 

4. 5. 5. 5 Wait GO . When CU sends this signal (one clock), all enabled EU's 
enter the IGH state (waiting for -GO) in place of the next instruction start. The 
current instruction is or will be finished. ■ 

4. 5. 5. 6 Disable. When CU sends this signal (one clock), all enabled EU's 
enter the disabled (EN) state in-place of the next instruction start. The current 
instruction is or will be finishedi 

4. 5. 6 Extended Memory and Transposition Network. The EM and TN are not 
modeled as resources that may be busy; thus it is assumed that during execution 
of CU-EU code, the EM is never in use for DBM transfers. The EM access 
time and data transmission time through TN are properly modeled in the 
execution time of the LOADEM and STORE M instructions. 

4. 5. 7 Code Simulated. The hand-compiled TURBDA assembly codes are 
given in Table 4-1 and 4-2, together with an assembly coded SQRT, which 
is a simplified version omitting the tests and -Branches' for negative argument 
and for negative exponent, 

4., 5. 7. 1 Processor Code. The large amount of integer computation at the 
beginning of each pass through the TURBDA loop would give a low utilization 
of the Floating Point unit, were not for the large block of FP calculation in 


4-16 


4-17 


Table 4-1. TUREDA Processor Code Simulated by Model 



(ICALL not simulated) 

PL 

PblVM 

IL 

IL 

L3 ITIX. lA (Drop through 2 times, 
then ejdt to L4) 

ISHL 
IPNO 
lADD 
IblVL 
IS TORE 
IMULL 
IPETCH 
IL ' 

H4 ITIX. LI (Drop through 20 times, 
then exit to LI) 
lADDL 
ID52i 
LOADEM 
lADDL 
ID521' 

IL 

IPETCH 

lEQL (No Branch) 

IL 

LlOO LOADEM 
lADDL 
ID521 
IL 

IPETCH 

lEQL (No Branch) 

IL 

L200 LOADEM 
IPETCH 

IGT (No Branch) 

IPETCH 

IPETCH 


L20 


L30 


L40 


IGT (No Branch) 
lEQL (No Branch) 
lEQL, L20 
(jiimp to L20) 
FFETCH 
PASS 
PMUL 
PSTORE 
IJUMP, L40 
PPETCH 
PFETCH 
PADD 
FABS 
FMULL 
PMUL 
PSTORE 
JUMP, L40 
(Jump to L40) 
PPETCH \ 
FFETCH 
FADD 

FABS > 

FMULL 

PMUL 

FSTORE 

PL 

FFETCH 

PMUL 

ENTER SC^T 

PMUL 

PL 

FADD 

FDIV 

lADDM 

ID521 


Never 

Executed 


Never 

Executed 


STORE M 
JUMP L14 
(Jump H4) 


LI 

L4 

SQRT 


JUMP L3 
(Jump to L3) 
STOP 

IUPK3 

lADDL 

lANDL 

ISHL 

ISUB 

lADDl 

lANDL 

ISUB 

ISHL 

lADD 

lADDL 

IPAK3 

PADD 

PNEG 

PL 

PMUL 

FMAD 

PMUL 

jPMUL 

PMAD 

PMUL 

PMUL 

FMAD 

PMUL 

f'mul 

PMAD 

PMUL 

PNEG 

PMUL 

IRETURN 



the SQRT routine which is called once per loop. ICALL and IRETURN are ■) 
both estimated at 23 clocks, which may be pessimistie and considerably 
reduces the FP utilization of SQRT. In an inner loop such as this, SWRT 

r 

should probably be written in-line, since it will occupy no more than 20-30 
words, and about 50 clocks are saved. 

Note that the outer loop, starting at L3, is executed twice, and each time 
the inner loop, starting at L14, is executed 20 times. This is a sufficiently 
large sample of code execution to give valid statistics. Within the inner 
loop, in the actual code, each EU will execute one of three branches, de- 
pending on the index states, ha- the simulation, only the branch starting at L20 
(the longest of the three) is executed. The other two are never executed, as 
indicated. 

In the actual code, two of the LOADEM's are conditional (LOADEMC). However, 
only the EM address and EM data input are conditional, the timing being the 
same, so the simulator- makes no distinction. 

4. 5. 7. 2 Control Unit Code. , The Control Unit code of Table 4-2 begins with 
LOOP, because .the model starts with all EU's waiting for GO. When. (IG + EN) 
is true, LOOP causes both CU and EU's to branch to specified addresses by the 
LOOP instruction, and this is a convenient way to get the simulator to jump 
to the desired addresses in the simulated code files. 
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Table 4-2. , TURBDA Control Unit Code' Simulated 


LOOP 

CL 

CL 

L3 - CTIX, L4 (Drop through 2 times, then Branch L4) 
CSHFN 
CMULL 
CPETCH 
. ■ CL 

L14 CTIX, Ll (Drop through 20 times, then Branch LI) 
CADDL 
CADD 
CMD521 
CL 

LOADEM' 

CADDL 

CADD 

CMD521 

LOADEM 

CADDL 

CADD 

CMD521 

LOADEM 

CADDL 

CADD 

CMD521 

STOREM 

CJUMP, L14 (Jump to L14) 

Ll CJUMP; L3 (Jump to L3) 

L4 CRETURN 

END SIMULATION 



The only synchronization instructions in this code sample (aside from LOOP) 
are the .three LOADEM's and the STOREM. 


The CU and its sjmchrnoizing action are simulated in some detail to determine 
two things:' 

(1) How much do processors wait at sync points for other processors 

to catch up? , 

(2) Do processors ever wait at sync points for CU to catch up, and . 
if so, how much? 

4. 6 SIMULATION RESULTS 

The simulation runs were made with a model having the Control Unit and 
four processors. The code driving the model was the TURBDA code shown 
in Tables 4-1 and 4-2, except that the outer loop was reduced to one iteration 
and the inner loop to 10, in order to reduce machine time for these first trial 
runs. Under these conditions the simulation indicates* that the abbreviated 

r 

TURBDA runs '4600 clocks on 184. microseconds assuming a- 25-megahertz clock. 
The full size TURBDA with two iterations in the outer loop and 31 in the inner 
loop would run about six times as long, or 1100 microseconds (27, 600 clocks). 
The parallelism is 31x31 = 961, compared with 1024 possible in two iterations; 
so, the efficiency of array use is 93, 8 percent in this case; 


In the simulated, TURBDA run, each processor performs 281 floating point 
operations lasting a total of 2407 clocks, for an average of 8. 6 clocks per FLOP. 
The elapsed time of 4600 clocks yields an effective throughput of 1. 53 MPLOPS 
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per processor. The array throughput would, then be 782 MFLOPS, or 733 at 
93. 8 percent array efficiency for the 31x31x31 problena. As expected for 
TURBDA, these rates are considerably lower than 1000 MFLOPS. This 
reduced throughput has three causes: 

(1) . There are 40 EM accesses with the 281 floating point ops, or a 

ratio of only 7 to 1. The. EM accesses themselves do not cause 
appreciable delay, but the integer operations required to calculate 
the EM addresses do cause, delay. 

(2) The floating point operations of TURBDA contain more than the 
normal proportion of multiplies and divides, raising the average 
duration from the nominal 7. 3 clocks to 8. 6 clocks per floating 
point operation. 

(3) The function SQRT was simulated as a subroutine, with enti:y 
and return operators. . It is likely that the com pile r will put 
simple functions like' SQRT in line. K so, the total time would 

• be only nine tenths that shown, for an 11 percent increase in 
measured throughput. 

Some other conclusions of interest are: 

(1.) Control Unit processing causes essentially no delay (less than 
0. 5 percent of the total time) 

(2) Extended memory accesses occupy 11. 5 percent of the time, 
including all synchronizing delays. 
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(3) Program fetches cause little or no delay. The model does not 
measure such delays exactly, and should be modified to do so. 
Program memory is in use 42 percent, 

(4) The utilization of the integer unit is 47 percent, data memory 
10 percent and floating point unit 58 percent, for a total of 
115 percent, indicating the approximate degree of overlap. 

(5) The inner loop takes 450 clocks, of which 197 are in the SQRT 
routine. Two thirds of the floating point operations are in the 
SQRT routine. 

Figure 4-4 is an example of one of the output tables of one of the simulation 
runs. The unit types represent various system resources as indicated by 
the row headings typed in on the left. In some cases the resource is used for 
internal control purposes in the model and does not represent a real system 
component, so is unlabelled. Some of the resources represent logic levels 
and signals such as READY, GO, IGH+EN, EN=0. A processor or CU waiting 
for such a level or signal is modeled as queueing for the resource, which is 
created to represent the presence of the level or signal. 
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UNIT utilization STATISTICS 





UNIT 

TO TAL 


percent 

OF ACTIVE TIME 


UN IT 

ID 

TIMES 

. .. IN 

-USE. . . 

.depeadem. . 

FREE 


iTYPE 

^UM8 

USED 

del TA 

TOTAL 

DELTA 

TOTAL 

TOTAL 


CUM 

Zif 

3 

173 

1 1. 27 

11.27 

0;.00 

c.oo 

88.73 


CUPROC 

25 

A 

233 

96.89 

98.89 

0.00 

0.00 

1.11 


CPSTAK 

II 

20 

5 

171 

1 1. 86 

11*86 

0.00 

0.00 

88. lA 


II 

20 

6 

173 

11. AO 

11. AO 

0.00 

0.00 

88. CO 


II 

20 

7 

171 

1 1. 92 

11.92 

0.00 

0*00 

88-08 



19 

a 

8 

170 

1 1. 62 

11.6 2 

0.00 

0.00 

86.38 


— — — 

23 

9 

1 

C.07 

0.07 

0.00 

' 0.00 

99.93 



28 

10 

186 

C. 91 

0.9 1 

0.00 

0.00 

99.09 


READY 

15 

1-1 

2A0 

C.87 

0.87 

0.00 

0.00 

99. 1 3 


60 

17 

12 

2A6 

C. 00 

O.OC 

0.00 

c.oo 

100.00 


{IGH+EN} 

21 

13 

82 

• CiOO 

0.00 

OiOO 

0.00- 

laoirca 


#EN=0 

26 

1 A 

1 

c.oo 

0.00 

0.00 

0.00 

lOC.CO 


lU 

■ 3 

15 

506 

A7.01 

A7.01 

0.00 

0.00 

52-59 


FPU 

4 

16 

36 A 

57. 7A 

57. 7A 

0.00 

c.oo 

A'2.26 

o o 

PDM 

5 

1 7 

15A 

5. 97 

9.97 

0.00 

c.oo 

9 0. C 3 

§ S' 

PPM 

7 

18 

1 3A6 

A2. 37 

42.37 

0.00 

c.oo 

57. 63 


HOLD") f 

8 

19 

1 

C.OO 

0.00 

0.00 

0.00 

100. CO 

82 

REGIS- W 

9 

20 

53 

S.60 

9.60 

0.00 

c.oo 

90. 40 


TERS J 

■ 10 

21 

52 

1 C . 1 8 

10.16 

0.00 

0.00 

89-62 


QUEUED 

• 13 

22 

558 

77. 55 

7 7.55 

0.00 

■ 0-00 

■22.45 


PPSTAK 

12 

23 

862 

35.50 

35.50 

0.00 

c.oo 

6A.50 


II 

12 

2A 

862 

35.53 

35.53 

0.00 

c.oo 

64.47 

a ^ 


12 

25 

860 

AC. 69 

40.69 

0.00 

0-00 . 

59.51 


I 

M 

CO 


Figure 4-4, Ssunple of Simulation Output 



CHAPTER FIVE 
RELIABILITY 


5.1 INTRODUCTION 

This chapter presents two major aspects of the NASF reliability 
and trustworthiness; (1) an availability prediction of the FMP and 
(2) further development of the error detection and correction 
techniques to the various FMP elements. These topics are covered 
in sections 2 and 3 of this chapter, respectively. 

The system availability design goal for the B7800 host system and 
the Plow Model Processor (FMP) is 90 percent or better. Also, it 
is desired that the probability of .success for completing runs of 
ten minutes and one hour be equal to or greater than 98 percent 
and 90 percent, respectively. The following is the conventional 
formula for computing availability 


A = ' mot 

MUT + MDT 


where, 


A = Availability 
MUT == Mean Up Time 
MDT = Mean Down Time. 


Up time is' the duration during which the system is continuously 
up. Down time is the interval between up timers. It can be seen 
that a system MUT = 9 hours or longer combined with a system MDT - 
1 hour or less satisfies the availability goal. These values also 
satisfy the desired reliability, or probabn.i'ty of success, as 

■V 

evidenced by the following, formula 
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R{t) = e-t/SMUT 
where, 

R(t) = The probability of successfully completing a run as 
a function of t 

t = Duration of the run {hours) 

SMUT = System Mean Up Time (hours) 

5.2 AVAILABILITY PREDICTION 

The following methods were employed in preparing the FMP avail- 
ability predictions discussed below. 

- Standard component part failure' rates were predicted using 
the reliability stress analysis prediction method of MIL-HDBK- 
2176. 

- Potential improvements in reliability through the use of 
Single Bit Error Detection and Correction and Double Bit 
Error Detection (SECDED) in' the FMP memories, fanout, tree, 
and transposition network were analyzed using a mathematical 
model developed specifically for the proposed design of these 
elements. 

- System Reliability, Availability, and Maintainability (RAM) 
characteristics were analyzed using Program DESIGN, which was 
developed by the Burroughs Corporation to aid in .designing 
faul.t-tolerant- computer systems. 

MIL-HDBK-217B is used extensively throughout the electronics 
industry to predict the failure rates of electronic component 
parts. Since the prediction methods of MIL-HDBK-^217B are quite 
detailed and documenta.tion describing these methods is readily 
available, only the general aspects of component part failure rate 
predictions are discussed in this report. 



Appendix B contains a description of the SECDED mathematical 
model, including the underlying assumptions associated with the 
development of this technique. 'A similar -description of the 
mathematical model employed in Program DESIGN is in preparation. 

5.2.1 OVERVIEW 

The proposed Flow Model Processor (FMP) design will be implemented 
using state-of-the-art technology of today and currently proposed 
state-of-the-art technology for the time frame during which 
manufacturing of the FMP will be initiated. Obviously, accurate 
reliability projections for some of the LSI component parts re- 
quired to implement the proposed machine are difficult at this 
point in time. Likewise, projections with respect to gains in 
reliability through the use of techniques such as Single Bit Error 
Detection and Correction and- Double Bit Error Detection (SECDED) 
can only be hypothesized based on assumed' failure modes until the 
design is completed., built, and tested. 

Recognizing that the above and additional considerations must be 
seriously addressed to ensure meeting the specified system 
availability requirements of 90 percent, an analysis has been 
conducted to bound the potential availability of the current FMP 
design. Both optimistic and conservative points of view have been 
considered for those conditions which can not be accurately 
projected at this point in time. In addition, sensitivity 
analyses have been conducted within the upper and lower projected 
availability bounds to determine where design attention- must be 
concentrated in order to achieve the stated availability require- 
ment and reap the greatest reliability and availability gains for 
the effort expended. 



The results of this preliminary availability analysis serves two 
purposes. Pirstr the analysis shows specific failure, recovery, 
and repair time reliability and maintainability estimates at the 
subsystem, module, and component part levels that are consistent 
with overall system availability of 90 percent and MTBF of 9 hours 
or better. Second, the analysis numerically bounds achievable 
Mean-Up-Time (MUT), Mean-Down-Time (MDT) and Availability 
estimates within the broad range- of reasonably optimistic and 
pessimistic assumptions. 

The following paragraph summarizes the results of this preliminary 
availability and the rationale for the assumptions made. As the 
FMP design progresses, the availability analysis will be iterated ■ 
to further refine specific reliability and maintainability 
estimates to narrow the .bounds of uncertainty associated with 
these preliminary projections. 

5.2.2 Summary of Results 

The first step in this analysis was to develop an overall 
Availability block diagram of the FMP {Figure 5-1). The estimated 
parts counts for all major elements, considering the types of 
component parts currently envisioned, were then prepared. For 
standard component parts, failure rates were predicted using the 
reliability stress analysis prediction method of MIL-HDBK-217B . 
Consideration’ was then given to the failure rates of large memory 
packages (16K, 64K, 256K) of the future, it was hypothesized that 
the best that could be expected in terms of reliability is 
achieving failure rates equivalent to those achievable today for 
4 k memory packages {approximately 0.1 Failures Per Million Hours 
{FPMH)). The worst reliability that one could expect to encounter 
was judged to be equivalent to the series failure rate build up 
for the number of 4K parts required to make up the larger memory 
packages; i.e. for 16K: 0.4 FPMH, for 64K; 1.6 FPMH, and for 

256K: 6.4 FPMH. Using these component part failure rates for 

each of the major elements provided the upper and lower bounds 
with respect to projected device reliability. 
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526 DATA BASE MEMORY MODULES 


J 


* 

The notation M/N means that out of the N identical elements in 
the system, M must be operating for the system as a whole to 
be operating. 


Figure 5-1. Availability Block Diagram of the FMP 
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Next/ a mathematical model was developed to- study the potential 
improvements from SECDED, Using this model, it was found that 
gains could vary from a lower bound factor of 2 to upper bound 
factors of 164 for 16K, 327 for 64K, and 653 for 256K memory 
packages. 

Finally, redundancy was considered. In this case, the ability to 
automatically detect, isolate, and decommit failed elements 
without noticeable interruption was investigated. As an upper 
bound on reliability, perfect recovery was considered. The lower 
bound was established for a situation where no recovery without 
interruption could be achieved. In this portion of the analysis, 
both permanent type failures which require a repair action and 
intermittent’ type failures which only require a recovery action 
were factored into the computations. 

Using the previously discussed upper and lower bound values, it 
was determined that the design potential availability for the 
currently proposed FMP is’; 

* Upper Bound; Apj/jp = 0.9995 (see Figure 5-2) 

* Lower Bound; Apj^p = 0.9554 (see Figure 5-3) 

Both these optimistic and conservative forecasts indicate a high 
degree of confidence in the ability of the proposed design to meet 
the overall system availability requirement of 90 percent. Using 
the above upper and' lower bound availabilities for the BWP, it can 
be shown that the required availability of the B7800 host system 
to meet the 90 percent system availability is; 

* %7800 " .9004- for the Upper Bound FMP Requirement • 

* ^7800 " . 9420 for the Lower. Bound FMP Requirement 



The above required availability values for the B7800 host system 
are currently being exceded by Burroughs B7700 systems operating 
in the field today. Since the B7800 system is expected to be even 
more reliable and maintainable than currently available B7700 
systems, the overall system availability requirement for the PMP 
and the B7800 host system appears to be reasonable ancJ achievable. 

The data used to obtain these results are presented and discussed 
in the follov?ing sections - 


5.2.3 THE BOUNDS OP FMP AVAILABILITY 

This section shows the bounds of the failure rates of all packages 
and subsystems. The bounds of MUT (Mean-Up-Time), MDT 
(Mean-Down-Time) and availability of the FMP are the highlights. 
The failure rate of the system is significantly reduced with 
judicious design and the following factors: 

1. -A ground-based .benign environment, where there is nearly 
zero environmental stress with optimum. engiheering 
operation and maintenance 

2. Use of high quality parts, MiL-M-38510, class B level 
commercial parts being, strongly suggested 

3. On-line processor spares 

4. Error correction techniques, including SECDED. 

5. Adequate maintainability, as reflected in time to- repair. 


5. 2. 3.1 PACKAGE FAILURE RATES 


The circuit packages are the' basic elements in the FMP and accom- 
panying the 'reliability of the FMP is a function of the failure 
rates of these packages. As mentioned in the previous section', 
the failure rates of digital circuit packages are predicted with 
the guidelines of MIL-HDBK-217B . Table 5-1 shows the' predicted 
failure rates and the operating environmental conditions of the 



control or logic, packages used, in the FMP, For the memory 
packages, the lower bound of those failure rates is 0.1 FPMH. The 
assumed upper bound of the failure rate of an m-bit memory package 
(m >4,000), denoted as ^m, may be computed with the following 
formula; representing the failure rate of the same memory built of 
4k-bit parts. 


= m X UPFERBOUND F.R. FOR 4K MEMORY ppwH 

4K BIT 

= m X i_ FPMH = M X 2.5 X 10“5 PMPH 

4,-000 

Table 5.2 shows the upper bounds of the failure rates of a variety 
,of memory packages. 

5. 2.3. 2 THE FAILURE RATES AND MTBF OF SUBSYSTEMS 

A subsystem contains the packages listed in Tables 5-1 and 5-2. 

The failure rates of the subsystems of the FMP are predicted by 
parts count method. The memory subsystems failure rates are 
modified by the SECDED reil ability improvement factor which is 
defined' as the ratio of the subsystem MTBF with SECDED to that 
without SECDED. The factor is discussed in detail in appendix B. 
It can vary from two to six hundred and more, depending on the 
size of the memory package. Table 5-3 presents the list of the 
packages, the fail.ure rates and MTBF of the control or data 
processing subsystems. Table 5-4 and 5-5 show the bounds of the 
failure rate and MTBF' s of the memory subsystems. The upper 
(lower) bounds of the failure rates (MTBF's) are predicted with 
the SECDED reliability improvement factor of two and the failure 
rates of the memory packages at their upper bounds. The lower 
(upper) bounds of the failure rates (MTBF's) are generated when 
the SECDED improvement factors are at their upper limits and the 
failure rates of the memory packages are on their lower bounds. 
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Table 5-1. -The Predicted Failure Rates of the Control or Logic Packages 


PART RUHEE.fl *PAHT OESCRIPTICN *T YPE*G/T /B *PIN S*TE K F*E K V* Q LAL* C U M *IN0 I VICU AL FR* 


lOOC 

0001 

ECl 

CONTROL -SSI -I 

dig 

4 

16 

45 

GE 

B 

1 

0* 00 622 

lOOC 

0002 

ECL 

CONTROL-SSI-II 

Die 

6 

16 

45 

GE 

B 

1 

0. 0077E 

.loao 

000 3 

EGL 

CONTROL-SS I-III 

D IG 

15 

16 

45 

GE 

9 

1 

C.C13C7 

1000 

OOOA 

ECi 

CONTROL -SSI-IV ’ 

0 IG 

22 

16 

45 

GE 

B 

1 

0.01633 

lOOC 

COO 5 

ECl 

CONTROL-KSI 

OIG 

4C 

16 

60 

GE 

6 

1 

0. C6oez 

lOOC 

COOS 

CCL 

CONTROL-LSI 

DIG 

130 

16 

60 

GE 

B 

1 

0.13000 


Table 5~2. The Upper Bounds of the Failure Rates of Memory Packages 

PART NliKBER *PAET OESCRIPTICN E*G/T/B*PIN S-TEKP-ENV* flUAL*mRT* INC I VICUAL FF* 


2000 

0001 

KQS 

16K RAP 

RAH 16C0C 

22 

6C 

0) B 

B 

1 

0.4000C 

2000 

000 2 

POS 

64K RAP 

RAH 64000 

22 

60 

C,B 

3 

1 

. 1.600CC 

2000 

COOT 

HOS 

256K RAM 

RAM 25-6000 

22 

60 

QB 

B 

1 

6. 40000 



Table 5-3. The Predicted-Failure Rates and MTBFs of the Control Subsystems ' 


LEVEL 1 OESlGNATlOMt PE 
'‘ART NUMBER *PART DESCRIPTION *TYP' 

1000 0006 ECL CONTROL-LSI DIG 

• MTBF= 7692L.16 12.9998 FAILURES 

• LEVEL 1 DESIGNATIONS 'U 

PART NUMBER *PART. DESC RIPTION ♦ T,Y-| 

lOOO 000 1 ECL CON Tru L-SS I- : DIG 

1000 0005 ECL CONTROL-MSI •'IG 

MTEF= 134A9.1 5 73. ’GAT FAILURES 

LEVEL 1 OESIG'IATION: EOT 
PART NUMBER *PART DESCRIPTION *TY-’E 

1000 0002 ECL CONTRQL-SSI-II OI 

MTBF= 6A287.32 15.5552 FAILURES 

LEVEL 1 designation; TN 
PART NUMBER *PART DESCRIPTION *TYi^E 

1000 C004 ECL CON TROL-SS I-IV OIG 

MTer= 5843.5 9 171.1278 FAILURES 

LEVEL 1 DESIGNATION; TNC 
“ART number *PART DESCRIPTION »TY"I 

1000 000 1 ECL CDNTR'J L-SS I- I 015 

MTfiF= 3215 32.40 3. 11 1 FAILURES 

LEVEL 1 designation; FH-C 
PART NUMBER *PART DESCRIPTION *TY'» 

lOOC C003 ECL CONTRQL-SS I-III OIG 

MTEF= 2550138.28 0. 5921 FAILURES 

LEVEL 1 designation; CBM-C 
PART NUMBER • *PART DESCRIPTION *TY' 

1000 0033 ECL CONTRCL-SS I-II I 1 I'* 

MTfiF= -76504.15 13.0712 FAILURES 


i*G/T 78 * P’INS*TEMP*ENV*aUAL*flUA|i( T*IND IVIDUAL FR* 
130 16 60 GB ■ B 100 0.13000 

PER MILLION HOURS 

:»G/T/5*PINS*T- MP*ENV*QU AL *GUAn T*IN0 IVIDUAL FR* 
4 ’6 45 GS R 2C00 , 0.00622 

AO 16 6' GR ^ ICOO 0.06082 

PER PILLION HGUfiS 

*G/T/B*PINS*TEMP*ENV'- QUAL*SUANT*IND IVIDUAL "R* 
6 16 45 G8 9 2000 0.00778 

PER MILLION HOURS 

*G/T7B *PINS*T'r MP*ENV*QUAL*GUANT*-IN0IVIDUAL FR* 
22 16 45 GB 8 10480 0.01633 

PER MILLION HOURS 

♦G7T/B *PIN S*T' MP«ENV*aUAL*GUANT*INOIVIOOAL FR* 

4 1 6 45 G6 9 500 0. 00622 

PER “ILLIQN HOURS 

*G/T/5 *PIN S*Ti--MP*ENV*aUAL *GUANT*IN0IVIDUAL FR* 
15 16 45 G3 B 30 ' 0.01307 

PER MILLION HOURS 

*G/T /3*PINS*Tr MP«ENV *QU AL *C U AS T *INO I VIDU AL FR* 
IS 16 '45 0“ B IC'OO 0.01307 

PER PI LLION. HOURS 


TOTAL FR* 
12.99982 


TOTAL FR* 
12.4404 3 
60.82423 


TOTAL FR* 
15.5551 7 


TOTAL- FR* 
171 .12779 


TOTAL FR* 

. 3.11011 


TOTAL FR* 
0.39214 


TOTAL FR* 
13.07119 


oi 

I 


o 



Table 5~4. The Lower (Upper) Bounds of the Failure Rates (MTBF) of-the 

Memory Subsystems 


LEVEL 1 designation: PEN 


PART 

lOOQ 

2000 


PA RT 
2000 
1000 


NUHBER 
0001 
000 1 

NTBF= 


*PART DESCRIPTION 
ECL CONTROL -SSI- 1 
NOS 16K RAN 


*TYPE*G/T/B*PINS*T£HP»EN,V*QUAL*0UAAT*INDIVIDUAL FR* 
DIG A tS AS GB 6 15 0.00622 

RAM 16000 22 s>0 0 55 0-00061 


788A153-T5 0. 1268 FAILURES PER MILLION HOURS 

LEVEL 1 designation: PEPH 


NUMBER 
0001 
009 1 

HT8F=: 


•PART DESCRIPTION »T rPE*G/T/B*PINS*T£MP«ENV*aUAL*OUAN T*IND I VIOUAL FR* 

NOS 16K RAM RAM 16000 22 50 C,B 3 0-00061 

ECL CONTROL-SSI-I DIG A 16 A5 GB • B 15 0-00622 

9060039.53 O-llOA FAILURES PER MILLION HOURS 


LEVEL I OESIGNATICN: CLC 
PART NLMBEfi *FAfiI OESCRIPTIGN 

2CCC CCOl PCS 16K RAP 

PTEF= 2SE2CS25.3A 


• TIr'E*G/T/B*PlNS*T£HF*EAV*QljAL*CLAM*INC IVICUAL FF* 

RAP 16C0C . 22 6C GB 3 55 O.COOEi 

C.C325 FAILURES FER PILLION HOURS 


LEVEL 1 CESIGNATICN: Ep-H 

PART MPEER •PART DESCRIPTION * T YF £ • 6 /T /B • PI N S * T£ P f •£ N V * OL AL * t L AM * IN C I V ICLAL FR* 

20CC CC02 KGS 6AK RAP RAP 64C0C 22 &0 CH , u 55 C.CC03C 

PT£f= 5S6516’A-45 C.C1£8 FAILURES PER PILLION HCLRS 

LEVEL 1 CESIGNATICN: CEP-m 

part number *PART OEsCRIFTICN *TYP£*G/r/d *PlNS*TepF*ENV*GLAL*UAM*INC IVICUAL FR* 

2CCC CCC3 MOS 256K RAM RAP2?6oOo 22 &C tiB B 55 C-C0015 

PTEF=ua25r7S3-Ae C.006A FAILURES FER PILLION HOURS 


total FR* 

0-09330 

0.03353 


total FR* 
0.01T07 
0.09530 


TOTAL FR* 
C. 0335 3 


TOTAL FR* 
0 .C16/6 


total FR* 
0.0C84 2 



Table 5-5. The Upper (Lower) Bounds of the Failure Rates (MTBF) of the 

Memory Subsystems ' 


*TyPE*G/T/B*PINS*TEMP*ENV*QUAL*CUAf(T*IMOIVIOUAL FR« 
OIG A 16 65 GG 8 15 0.00622 

RAM 16000 ■ 22 0 r;,R P, 55 0.20000 


level I designation: PEM 
PART NUMBER yPART DESCRIPTION 

1000 000 1 ECL CONTROL-SS I-I 

2000 0001 NOS 16K RAM 

MTBF= 901AA.48 11. 0933 FAILURES PER MILLION HOURS 

LEVEL 1 designation: PEPN 

PART NUMBER *PART DESCRIPTION *T YPE*G/T/B*PINS*TEMP*E NV*0UAL*0UAM* IND I VIOUAL FR* 

2000 0001 HOS 16K RAM RAH 16000 ' 22 60 28 0.20000 

1000 0001 ECL CONTRQL-SSI- I DIG 4 16 45 GB B 15 0.00622 

MT8F= 175644.96 5.6933 FAILURES PER MILLION HOURS 

LEVEL 1 OESIGNATIGN: CUM 

PART NUMBER ' *PARI OESCfilPIICN *T TPE* G/I/B» PIN S* T£ MP*EN V* QUAL* « UAM« INC I VICUAL F F * 

2C0C OCOl HOS 16K RAM RAM 16COO 22 6C 3 ' 55 0.20000 

MTEF= $0509.09 11.0000 FAILURES PER MILLION HOURS 

LEVEL 1 CESIGNATIQN: EM-M 

PARI NLMBEfi *PART DESCRIPTION *I YP£*G/T yB •PINS*T£M> *EN V*OL AL*«L AN I * INO I VIDUAl f 8* 

2000 CC02 MOS 64K RAH RAM 64QOO 22 (,0 P' 55 0.800CC 

MTEF= inn,Zl 44. COCO FAILURES PER MILLION HOURS 

LEVEL 1 designation: C6M-H 


PART NUMBER 
200C CC03 

MTEF= 


•PART DESCRIPTION *T YPE *G/I/B *PINS •TEMP*ENV »0UAL U AM • IND I V lOUAL FR» 

HOS 256K RAM RAM25S000 22 60 B 55 3,20000 

5681.82 176.0000 FAILURES PER MILLION HOURS 


total FP* 

0,0933 0 

11.00000 


total FR* 

5.60000 

0.09330 


total FR* 
ll.OCCOO 


total FR* 

44 .00000 


TOTAL FR* 
176.00000 


55 



The legends of these and following tables are defined as; 

TYPE - Integrated circuit type 

G/T/B - Number of gates, or of transistors, or of bits 

TEMP - Junction temperature predicted with MIti-HDBK-217B 

ENV - Environment (GB - ground-based benign or standard office 
environment) 

QUAL - quality/screehing level (B-MIL-M-38510 , class B) . 

QUANT - not listed in table 5-1 or 5-2 

INDIVIDUAL FR - individual faiure rate (per million hours) 

Some of the other terminology in these and following tables and 
figures is as follows. Mnemonics representing elements of the FMP 
are the same as those shown in Figure 5-1, such as "FOT" for 
"fanout tree" or "TNG" for the "control portion of the transposi- 
tion network". "MRT" has been used for "mean down time"; the 
programmer was thinking that all down time was repair time. "RE" 
recovery efficiency is the fraction of the time that a retry is 
successful. For example, for a single bit failure in memory 
covered by SECDED, RE is 1.000. For a catastrophic "single point" 
failure, RE is 0.000. "Single point" identifies those portions of 
the system where a failure at- a single point disables the system. 

5.2.3. 3- AVAILABILITY OF THE FMP 

The major task of this section is to assess the bounds of MUT, and 
availability using the program DESIGN. Using the program we can 
thoroughly investigate critical factors pertinent to the failure, 
repair, and recovery processes. As required, the following 
determinants of system interruption and downtime have been 
included 



* Permanent and Intermittent Hardware Failure -and Repair 
Rates 

* System Automatic SJecovery Features 

* System Manual Recovery Rates 

Sufficient data have been collected for design new systems 
successfully. With these data and all informations from the 
previous sections, the program provides an output with all salient 
input data and analytical results. The computer printouts used 
designations matching those- on the block diagram of Figure 5-1. 
Corresponding to Table 5-4, Figure 5-2 shows a print-output which 
points out the upper bounds of MUT, and availability of the FMP 
are 1,032 hours, 0.43 hours, and .9995, respectively, as the MTBF 
of the hard failure is the same as the MTBF of the intermittent 
failure. Similarly corresponding to Table 5-4, Figure 5-3 
presents an output which shows the lower bounds of MUT and avail- 
ability are 3.5 hours and .9554 respectively, when the MTBF of the 
hard failure is ten times of the intermittent failure. 

5.2. 3. 4 SENSITIVITY ANALYSIS 


Since some factors shown in the previous sections are uncertain,, 
and the failure rates of the memory packages are unknown, a 
sensitivity analysis has been made to study how those factors 
affect MUT, MDT, and availability of the FMP. Here we perform an 
experiment with respect to all the factors. In the experiment, 
some wide range varieties are considered, as in the following; 

1. Two levels of the failure rates of the- memory packages, 
namely the upper bounds and the lower bounds as shown in 
Section 2.1 
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' A V lllf 


vame: 

R N 

MTBFIPI 

MTRF< n 

( SPFH 

ORT 

cu 

1 1 

13649 

13649 

O.OOQ 

I. 00 

CUH • 

2 2 

90 0 0 000 

— 

0 .00 0 

0,25 

Tisc 

1 1 

321532 

321532 

0.000 

0.50 

F3 T 

1 1 

6428/ 

— * 

0.000 

0.25 

TN 

1 1 

564 3 

— 

J.OOO 

0.25 

EM-C 

521521 

2550138 

2550158 

3 .00 0 

l.OO 

EM-M 

52 15 21 

9000 000 

— 

0.000 

0.25 

D0HC 

1 1 

/6504 

76504 

J - 00 0 

1.00 

OBM 

512512 

900000 0 

T- 

0.000 

0.25 

PROC-1 

129129 

75545 

75545 

3 .005 

r.oo 

PR3 C-2 

128129 

75 54 5 

75545 

0 .005 

1.00 

PROC-3 

129129 

7 5 54 5" 

75545 

0.005 

1.00 

PR3 C-A 

128129 

7554 5 

75545 

0.005 

1.00 


ft 

N : 

MTBFIPI 1 

MTBF (1) 
SPFM 
ORT 

SRT 

RE (PI 
RE (0 

omrt 

LEGEND 

Number of Oovjcej Rcqu*fed to be Operating for Success 

Number of Devices Available 

Meon Time-Bctwaon FaUyres Permanent 

Mcoo Time Bet\vcct> Fai>ur« - Inlermir.ent 

Percentage of Failures that are Single Point Fallurcf 

DCvice-Repai*’ Time — Permanent failures 

Siogte Point Failure Repair Time ^ Permanent Failures 

flccove'y EfftciCncy — Permanent Failures 

Recovery Efficiency — Ime^minent Faijures 

Ocvtcc fi/anual Recovery Time 



SRT 

RE tP ) 

REC I) 

DMRT 

MUT 

MRT 

avail" 

0. 0 0 

0. 000 

0.000 

0.10 

6 824 .5 

0.550 

0. 999919 

0.00 

O.OCO 

0.000 

0.10 

NO EFFECl 

' ON PERF0RHA6CE 

0.00 

0. 000 

OiOOO 

0.10 

160766.0 

0.300 

0.999998 

0.00 

0.000 

0.000 

0.10 

64287.0 

0.250 

0.999996 

0.00 

0.000 

0.000 

0,10 

5843.0 

0.250 

0,999957 

0.00 

O.OOQ 

0.000 

0,10 

2447.3 

0.550 

0,999775 

0. 0 0 

0, 000 

0.000 

0,10 

17274 .5 

0.250 

0.999986 

0.00 

0- 000 

0.000 

0,10 

38252. 0 

0.550 

0-999986 

0. 00 

0. 000 

0.000 

0,10 

17578.1 

0.250 

0.999986 

0.25 

1.000 

l.OOO 

0.10 

50162.4 

0.222 

0.999995 

0. 25 

1 .oco 

1 . 000 

0,10 

50162.4 

0.222 

0.999996 

0. 25 

1.000 

1.000 

0.10 

50162-4 

0.222 

0.999996 

0.25 

l.OCO 

1.000 

0.10 

5 0162. 4 

0.222 

0.999996 


FMP 

TOT AL = 

1032. 1 

0.43 0 

.9995854! 


Figure 5-2. Print Output of the Upper Bounds of 3WTUT, MRT and Availability of the FMP 
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CU 

1 1 

13 64 9 

13 65 

}. poo 

i.po 

0.00 

0.000 

0.000 

0.10 

1240 .9 

0.182 

0.99985’: 

:UM 

2 2 

9 0 90 9 


0 .000 

0.25 

0.0 0 

0 . oco 

0.003 

0.10 

454 54.5 

0.250 

0. 959995 

FOT 

1 1 

64 23 7 

... 

p.ooo 

0.25 

0. 00 

0. 000 

0.000 

0.10 

642.87 .0 

0.250 

0, 999996 

TMC 

1 1 

321532 

3215 3 

} .000 

0.50 

0.00 

0.000 

0.00 3 

0.10 

29230.0 

0,136 

0.999995 

tm 

1 1 

584 3 


0.000 

0.25 

0.00 

0.000 

0.000 

0.10 

53 43. 0 

0,250 

0.999957 

EM-C - 

521521 

2550138 

255M4 

0 . 000 

1.00 

0. 0 0 

0.000 

0.00 3 

0.10 

445.0 

0.182 

0.99959? 

EM-H 

521521 

22 72 7 


3.000 

0.25 

6. 00 

0.000 

0.000 

0.10 

43.6 

0.25 0 

0,994 302 

DRMC 

1 1 

76504 

7651 

) .00 0 

1.00 

0.00 

0.00 0 

0,003 

0,10 

6955.4 

0,182 

0.999974 

03M 

5125 12 

563 2 

— ■ 

3.000 

0.25 

0. 00 

0.000 

0.00 0 

0.10 

11 .1 

0,250 

0.977969 

PRO C“1 

128129 

3357 2 

3 357 

3 .005 

1.00 

0. 25 

0.000 

p.ooo 

0.10 

23.7 

0.100 

0.99578' 

PROC-2 

128129 

33572 
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0.005 

1. 00 

0. 25 

0.000 
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0. 100 
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0,9 9 5” 8? 
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Recovery Efficiency — Intermittent Failures 

• Device Manual Recovery Time 



FMP 

TOT AL = 

3.5 

0.16 0 

.95548497 


Figure 5-3. Printout Output of the Lower Bounds of MUT and Availability of the' FMP 
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2. Two levels of SECDED improvement factors, taking "two" as 
the lower bound level while the upper level corresponding 
to the upper limit of different memory packages stated in 
Section 2.2 

! 

3. The ratio between the MTBT of intermittent failure to the 
MTBT of permanent failure are 1, 5 and 10. 

4- The recovery efficiencies are chosen from 70% to 100% 
with 10% increment. 

The results are summarized in -Table 5-6. From the results we 
learn the availability changing only from 96.13 to 99.96% is not 
significantly affected by those factors. If the memory packages 
are of a low -reliability level and SECDED improvement factors are 
low, MUT and MDT are affected slightly by them. On the other 
hand, if the memory packages are highly reliable and SECDED im- 
provement factor is large, the MUT is increased by 200% to 300% 
and the- MDT is decreased by 25% to 30% as- the ratio between the 
MTBF for permanent failures (MTBP{P)) and the MTPP for 
intermittent failures (MTBF(I)) changes from 1 to 5. Under the 
same conditions the MUT increases very rapidly as the recovery 
efficiency is close to 100%. Finally .it can be pointed out that 
the MUT is significantly affected by the reliability quality of 
the- .memory packages as expected. 

5.3 ERROR DETECTION AND CORRECTION 

^ ^ ^ Control - Coverage 

In the baseline system there are a number of mechanisms for error 
detection and correction. These include error detection and 
correction on all memories, with sufficiently powerful codes to 
guarantee unc.orrected error rates lower than a specified require- 
ment, and undetected error rates below an even lower required 
rate. 
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The itiechanisms fall into three classes. First, there are errors 
such that immediate correction is done., even if there is a single 
hard error in the machine. Error correction in memory is such. 
Second, there are errors that are detected immediately when they 
occur. Third, there is a repertoire of checks which is intended 
to detect as many as possible of those errors not detected 
immediately. For example, memory Words are initialized to 
"invalid".- As long as a substantial amount of memory is in the 
"invalid" state, there is a substantial chance of detecting a 
memory addressing error because of the "invalid" word fetched in 
response. 

Table 5-7 shows the pecentage of the total chips in the FMP that 
are covered- by each made of error correction. There are 
approximately ninety-eight thousand chips (49% of the machine) 
that have error correction capabilities applied to them in the 
baseline system. These are the memory chips. In addition there 
are about twelve thousand additional chips that are involved in 
data transfer paths of sufficient parallelism that the addition of 
error-correcting check bits in parallel would represent a modest 
(20% to 40%) increase in parts count. There are one hundred eight- 
teen thousand chips in the baseline system that have immediate 
error detection. This includes all the memory chips plus the 
transposition network which has =the EM error detection code on all 
data passed through it and' parity on microcode ROMs. We could add 
about nine thousand, chips to this total by putting a modulo-3- 
check digit on all arithmetic -units and adding parity or SECDED to 
the parallel path from CU to processors. Additional chips would 
be required by such additional error detection. 
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Table 5-7. Error Control Methods and Applicability 
Table 5-7. Error Control Methods and Applicability 


UNIT 

Error Control Methods Available at Reasonable Redundancy 

Error Control Methods Obscure | 


No . Ch ips 

Error Detection 

Error Correction 

No. Chips 

Comments 

PE 

7k ^lrith 

mod-3 check digit 
for arith. parity 
on microcode. 

Retry on error(?) 

34k non-acith. 

(Note 2) 

PDM/ 

PPM 

38k mem. 

yes (Note 1). 
SECDED will work. 

yes (Note 1). SECDED 
will work. 

14k control 

Many errors will be 
address errors, also 

TN 

10k 

EM’S SECDED catches 
hard ecrora(Note 1) 

Under investigation 

— 

— 

EH 

31k mem 

SECDED or better if 
needed. Note 1 

SECDED or better if 
needed. Note 1 

16k control 

Note 2 

Fanout 

2K in paral- 
lel paths 

Can add parity 

Can add SECDED at 2S% 

Ik single . 
signal 

Note 2 

CU 

DC 

^ mem. 

Same as PDM 

Same as PDM 

3k 

Ik 

Random logic Note 2 

EC not used during 
user program 

OBH 

29k mem. 

SECDED or Stronger. 
Note 1. 

SECDED or stronger 
code. Scrubbing of 
errors. Note 1. 

2k control 

Note 2 

TOTAL 

possib 

127k 

Le 

127k chips have 
error detectible at 
same clock that 
error occurs 

120k chips have error 
correctible even if 
hard failure exists 

71k 

Dominated by PE logic, 
and memory controls. 
41% of NSS. 

TOTAL 
as per 
baselii 

1 

118k 

le 

118k chips have 
error detectible 
at same clock that 
error occurs 

108k chips have error 
correctible even if 
hard failure exists 

80 k 

Dominated by PE logic, 
and memory controls. 
45% of NSS. 


Note 1. This error detection/correction is Included in the baseline system as described in the final 
report. 


Note 2. Consistency checks, initialization to "invalid", confidence tests, etc. are designed 

to forestall any error from going undetected for too long, andeteoted transient failures 
ace the primacy concern. 




5.3.2 Improvements over Reference 1. 

Reference 1 lists a large number oC reasonableness checks that 
attempt to monitor the errors in that 40% to 44% of the FMP for 
which direct error correction and error detection cannot be 
implemented simply. These include tests for "invalid", the code 
to which memory is initialized. These include a check for illegal 
opcodes, or memory addresses out of bounds, including bounds 
checks on index calculations. Unnormalized numbers should never 
be fetched. for a floating point operation. The list goes on. All 
of these are helpful. None, obviously, gives absolute protection. 

Three items should be added to the design of reference 1 in the 
area of error detection and correction. These follow. 

5. 3.2.1. On-line Processor Spares. An on-line spare processor is 
extremely effective in eliminating repair time, or postponing 
actual repair until convenient. Appendix C describes the imple- 
mentation in detail. One spare per cabinet is provided. 

5. 3. 2. 2. Error Detection, Error Correction in PPM, PPM, and CUM. 
These memories, whose memory chips account for 19% of all the 
circuit packages in the FMP, are to be provided with error 
correction. The final report seems to have obscured this 
requirement by laying stress on an error correction method which 
quite possibly may not work. Likewise, error detection for 
■uncorrectible errors is to be provided. SECDED is being provided 
in the baseline system, as of this report. 
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5 . 3 . 2 . 3 . Error Correction in the Transposition Network. The 
error correction code of the EM provides error detection against 
hard failures in the transposition network and error correction 
against single transient failures. This is included already in 
the baseline system design, even though reference 1 failed to 
emphasize it. It is possible to provide a TN design which 
corrects for single hard errors in the TN, just as SECDED corrects 
for single hard errors in memory. The best code for this purpose 
has yet to be determined. One design adds three signals to the 
already nine-wide TN path. Four Hamming check bits are applied to 
the eight data bits in each byte. The OR of all twelve bits can 
serve instead of the strobe, since all parities are odd. The 
byte-correcting code is in effect concatenated with the SECDED 
code used in EM, so no overall parity is needed for error 
detection; the SECDED takes care of that. 

5.3.3 Duplexed Computation 

For an almost 100% check on the computation, pne can repeat the 
user program, using a different set of 512 processor“s“for the 
second run. Using the processor switching of Appendix C, one can 
run- the problem first with the spare at the right end, and then 
second with the spare at the left end. If the answers agree, the 
answer is presumably free of any hardware error. Note that this 
method is- simpler, from a hardware implementation point of view, 
than operating the processors in pairs which shadow each other, 
but, like having pairs of processors do the same computation, it 
also cuts the throughput in half. 
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5.3.4 Hard Error Tolerance 


The habitual use of confidence and diagnostic checks, together 
with all the above error detection-, assures that a hard failure 
cannot remain undetected for long in the FMP. Repair time is , 
essentially zero for failures in that 82% of the chips in the FMP, 
where either error correction allows the FMP to continue to run in 
spite of the error, or processor switching switches in a spare 
processor while the bad processor is removed and replaced at 
leisure. For the remaining 18% of the components, repair is 
needed before the FMP can continue to run. Thus, detection of 
hard failure is more than adequately done and availability is 
aided by having- 82% of the failures associated with "zero" repair 
time, or postponable repair. 

5,3.5 Transients 

60% of the packages, if involved in some transient error, will 
produce effects that are immediately detected and usually 
corrected, leaving 40% not covered. Obviously, it is better to 
.include tests that have some chance of detecting error than not to- 
have such tests. However, it is difficult to guarantee that all 
transient errors will- be caught before the run ends for 99.9% of 
the runs. Even if we add mod- 3 check digits in arithmetic, and 
parity in the CO-to-processor fanout tree, 36% of the packages 
remain in this category. The part of the machine where detection 
of transient error is less than perfect consists of the memory 
control and proecessor logic, primarily not the arithmetic portion 
of the processor, but instruction decoding, register addressing, 
shifting, and miscellaneous logic. 

The main- defense against transient error is, and always has been, 
proper electrical and logic design. Wiring rules, noise budgets,, 
crosstalk calculations,- maximum delay calculations, and so on, are 
all part of the design. 
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CHAPTER 6 

TRADEOFFS DELINEATED 


6.1 INTRODUCTION 


The design of the FMP will result from tradeoffs among a number of 
factors 

* Performance 

* Reliability 

* Availability 

* Programmability 

* Spectrum of Applications 

* Cost 

* Schedule 

* Risk 

The first four factors are explicitly mentioned in the statement 
of work for the extension to this study contract. The fifth, the 
spectrum of applications for which the FMp is' to be designed, is 
mentioned here as it has a direct bearing on the results of some 
of the tradeoffs. For example, a scalar processor would probably 
not be included if the applications were strictly limited to 
aerodynamic flow and- meterological problems. Yet the scalar 
processor will be necessary for some other applications and will 
interfere only slightly with the other desiderata. 

Programmability covers two distinct aspects. First, is the system 
one with which the compiler writer can successfully contend? 
Second, is the system presented to the user, including its 
FORTRAN, an easy one? 
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Following are short discussions of specific issues where the 
result is a trade between factors. In many cases, simulation 
using test cases taken from the intended spectrum of applications 
is the appropriate tool to resolve the tradeoffs, 

6.2 LANGUAGE DEFINITION 

A part of the language definition in the extended FORTRAN to be 
used for the FMP in an exercise of trading off throughput vs. 
programmability. Proper language design finds some point where 
almost the maximum throughput of the machine can be applied to the 
desired spectrum of applications with little difficulity from 
language restrictions or awkard constructs. That is, the language 
restrictions necessary to ensure throughput do not interfere much 
with one's ability to write programs for the selected- set of 
applications . 

However, we note that programmability for all applications will 
interfere greatly with throughput, arid that absolute maximum 
throughput for all applications is likely to require a depth of 
analysis beyond that feasible in the compiler. 

6.3 MATCHING THE COMPILER AND THE -INSTRUCTION SET 

Hardware capabilities that are unused by the compiler are a waste 
of money and represent a flaw in the design. Capabilities in the 
language, that would be commonly and frequently used, for which 
the hardware provides no convenient way for the compiler to 
implement, result in awkward and inefficient code, and are also a 
flaw. However, the hardware, once specified, is not likely to 
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have its instruction set expanded much during the life of the 
machine, v;hile the compiler presumably will continue to evolve 
during that same period. Therefore, it is the capabilities of 
that eventual hoped-for compiler, not the simplicity of the first 
one, against v/hich the instruction set is to be judged. An 
example is the loading of PPM conditional on the "enable" bit. 

Our first compiler has no use for such a conditional capability. 
However, the capability costs almost nothing, since loading memory 
must be conditional on "enable" anyway, while the capability 
allows a type of concurrency between processors which we expect to 
be useful in the long run. 

6 . 4 WORD FORMAT 

In reference 1, a word format of 1 bit sign, 8 bits exponent, and 
39 bits fraction part is suggested as ideal for the FMP. The BSP 
uses 1 bit sign, 11 bits exponent, and 36 bits fraction. The 
' format with 7 bits exponent was determined as adequate for the 
Navier-Stokes application. The BSP format was arrived at after 
judging the precision and range requirments of a wide variety of 
applications. Thus, the BSP word format is more likely to be 
-suitable for a wider variety of applications, some. of which will 
require the additional range on the exponent, while the re- 
quirement of 10 decimal digits precision for the Navier-Stokes 
equations will be satisfied with either format. 

Therefore, for the purpose of being adaptable to a wider range of 
applications, and not incidentally, for the additional purpose of 
being format-compatible with an existing commercial product, it is 
• proposed to standarize oh a word format containing 1 bit sign, 11 
bits exponent, and 36 bits fraction part. 
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6.5 INSTRUCTION FORflATS 


There is a well-known tradeoff between code file size and ease of 
decoding the individual instruction. For example, a full-length 
address field in the instruction allows the use of absolute 
addresses where appropriate, v/hereas if the instruction has a 
short address field, it must always be with respect to some base 
address held in the hardware. 

In the present instance, a variation which we wish to test by 
simulation, during phase II, is the use of 32-bit and 16-bit 
instructions. The 16-bit instruction has room for only two 
register addresses; the 24-bit instruction contains three. 
Therefore the use of 16-bit formats will speed up instruction 
fetching while interfering with the optimization of the use of 
registers in the processor. According to one example tested, the 
instruction fetching is already faster than arithmetic execution, 
and 24-bit instructions will be preferred. 

6.6 SECDED 


Rigid requirements were set up for main memory in the FMP, 
consisting of PDM, POP, and CUM. Less than one bit in 10 l 6 is to 
be in error uncorrected, and less than one bit in 10 ^^ is to be 
undetected. To satisfy these requirements, a single-etror- 
correction, double-error-detection code is proposed. However, at 
this writing the actual error rates and failure mechanisms of the 
memory chips to be used are unknown. When these error rates and 
failure mechanisms become known, the SECDED should be reevaluated 
to make sure that it is neither too weak to cope with the error 
rates actually occurring, nor an overkill causing unnecessary 
cost. Since SECDED may permit the scheduling of repair while the 
system continues to run in degraded mode, it produces savings in 
maintenance cost while improving availability. The memory chips 
would have to be unbelievably reliable before SECDED did not pay 
for itself. 



6.7 TRUSTWORTHINESS VS. THROUGHPUT 


In considering error correction and detection, we credit the FHP, 
not with the total number of right answers it produces, but with 
the amount of answers that a rational user can use with 
confidence. One approach to trading off error correction and 
detection against raw throughput is to maximize this effective 
throughput. With no error correction at all, it is determined 
that most answers are probably wrong, and the effective throughput 
is practically zero, even though reams of so-called answers might 
be coming off the printer. With triple redundancy and voting on 
every element in the system, the throughput would be a fraction of 
the raw throughput with no error correction, but the answers would 
be very trustworthy. Somewhere between these extremes is an 
optimum- As explained in the last part of section five, the 
existing baseline system design has sufficient error detection 
that there is little chance for a hard error to go undetected for 
long. A more severe problem for the FMP is the defense against 
transient errors. 

In the baseline system design described in reference 1, 54% of the 
packages in the system have single error correction, so that any 
single error produced in these packages is corrected during the 
run, which' continues to produce correct answers. 11% of the 
packages have immediate detection of any errors in them, so the 
run terminates immediately if errors occur in them. The other 35% 
of the packages are covered- by a variety of error checks, which 
are intended. to eventually detect any errors. However, the 
detection is indirect and not immediate, and some transient errors 
will remain undetected. 
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If we apply additional error checks, throughput is reduced, but 
trustworthiness of the results is improved. Figure 6-1 is an 
oversimplified graphical representation of the effect. At some 
reasonable amount of error control circuitry, the effective 
throughput is maximized., Using f to represent the fraction of the 
total hardware devoted to error control (assuming total hardware 
remains constant), we can plot Tq, the "raw” throughput, equal to 
the number of inches in the pile of printout per _ hour , and T, the 
effective throughput which is the amount of useful answers 
produced. Tq decreases with f. In fact, Tq decreases faster than 
linearly with f, since (1-f) of the hardware is devoted to produc- 
ing useful output, and the fraction f that checks' for errors can 
only interfere. We can write: 

T-(To X (l-f))/G(f) 

The function G(f) can only increase with f, for any rational 
design. 

Finding the form of the funtion G(f) is probably not feasible. 
What can be done, however, is to estimate the effect on the 
detected and undetected error rates for any particular proposed 
error detection/correction technique, together with its effect on 
parts count or raw throughput. Each proposed error control 
mechanism costs certain percentage of the equipment, has a 
certain throughput reduction associated with it, and catches some 
percentage of otherwise uncaught errors. 

As an example, consider the addition of a modulo .3 check digit to 
arithmetic computation. Generating the check digit takes almost 
as much additional logic as is already in the adders -being 
checked. Thus, adding -7% to the chip count of the machine catches 
almost all errors occurring in what is now .about 7%. of the 
machine. In addition, the 7% new packages create errors of their 
own, which will usually be detected as arithmetic errors, so they 
do not add to the undetected error rate, but do create' false 
■alarms. 



Is a 7% false alarm rate added to the rate of detected error, a 7% 
increase in parts count and power, plus the throughput reduction 
due to the extra clocks used for checking, a fair price to pay for 
the X% decrease in the rate of undetected error? When the actual 
percentages are determined, perhaps the question can be answered. 

6 . 8 Parity within Processors 


Data transfers within the processor have been designed on the 
expectation that the reliability and accuracy of digital oper- 
ations in logic circuits can be made as perfect as desired at the 
design stage, using worst-case design. Whatever the error require- 
ments, careful design can ensure that the performance exceeds 
them. 

Parity checks on inter-register transfers could be implemented, 
including transfer to the memory address registers. Such parity 
checks will add about five chips to the processor logic for each 
parity check required. Four parity checkers, or twenty chips, may 
be needed. In addition, one clock, for the parity checking, will 
be added to many operations, including most of the operations that 
are now one clock long. Although ho careful study of the situ- 
ation has yet been done, it is apparent that parity checking 
internal to the processor will add 20% to the component count of 
the PE, will add errors of its own, and will degrade raw through- 
put significantly, while failing to check any of the processor 
logic operations, only the transfers. 

6.9 INSTRUCTION FETCHING MECHANISM 

In section two, the equipment description, a particular scheme for 
overlapping the execution of noninterfering instructions, and for 
doing some anticipatory instruction fetching was described. This 
scheme has not been validated in simulation to see how well it 


6-7 






works in real program streams as emitted by the compiler. 
Simulation studies to determine how simple an instruction fetching 
and overlap mechanism we can have and 'still maintain throughput 
would be desirable. Fortunately, most of the processor design 
details are independent of these decisions. 

6.10 LOAD EM AND STOREM BLOCK FETCHING 

The baseline system as described in Chapter Two of this report 
omits from the LOADBM and STOREM instructions the ability to 
stream N words out of each EM module in parallel for a total of 
512N words per instruction. Initial work on handcampiling from 
FORTRAN source for the NSS indicates that almost all fetching from 
EM is with N=l, {Example: SUBROUTINE TURBDA, See Ch . 3) If this 

turns out to be true in general, the block fetching capability is 
not worth the complexities it costs. Simulation, using test cases 
taken from real code, with multiple-word fetches allowed and 
disallowed, can be used to evaluate the effect on throughput. If 
N greater than 1 is necessary, the following changes to the 
baseline system of Chapter Two are seen; 

* Rearrangement of data on DBM-EM transfers is required, as 
described in the final report, so that, for N >1, data in 
EM -along the index in which streaming is taking place are 
all found in the same EM module. Rearrangement is neither 
needed or desirable when N=l. 

* The requirement for rearrangement of data disallows most 
equivalencing on EM arrays, a restriction on normal FORTRAN 
that need not be imposed if N=l. 

* EM module design becomes more complicated. To keep up with 
the TN streaming rate, the EM module is divided into two 
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submodules, as a side effect making the SECDED code less 
effective. A need to increment the EM address per word 
while streaming also adds complexity, especially since the 
increment is a large integer, not unity. 

* There is additional compiler complexity-. 

Enforcing the restriction that N must be 1 thus enhances relia- 
bility and availability, while simplifying compiler and operating 
system, and having an undetermined effect on. throughput. 

6.11 OVERLAPPABLE EM ACCESS 

A fourth instruction execution station could be added to the 
processor which would handle the EM access independently of the 
integer and floating point units at the expense of requiring two 
units contending for PDM, namely this EM unit, and the previously 
identified memory control. Having issued an EM fetch to this 
unit, no fetches from PDM would be allowed. 

The amount of’ increased overlap obtainable is dependent on the 
compiler's being able to insert the EM fetches ahead of the place 
where the data is required. In some of the loops in the benchmark 
programs, this requires the insertion of the EM accesses for the 
next iteration inside the current inter ation. The question to be 
answered by a tradeoff study is whether the increased- compiler 
complexity required to exploit such an addition to the design 
produces enough increased throughput to be worth the difference. 
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6.12 SINGLE PROCESSOR MEMORY 

Processor memory is separated into two separate memories for the 
sake of increased throughput. Data fetching and instruction 
fetching go on in parallel. Furthermore, no conflict resolution 
between fetching program and data need be implemented. The tradi- 
tional way of getting interlace between two memory modules in a 
single memory system 'is to make module number the least signifi- 
cant bit of the address. This particular method would not work in 
the processor, since data is fairly random, and program steps, 
although sequential, are interspersed with data fetches and 
stores. Thus, the two-memory design of the baseline system 
achieves better interlacing than the traditional scheme. However, 
it has, the drawback that program and data memory is not inter- 
changeable; a program just over 8192 words cannot overflow into 
data memory, and similarly for data. 

' \ 

An alternate • design for the processor memory is as follows. Two 
modules of 16384 words each are used to form a single homogeneous 
address space. Module number is the most significant bit. The 
compiler assigns all program addresses to the upper module and all 
data addresses to the lower module, except that, if either module 
is full, the other module can be used. 

The alternate design achieves just as good interlace of memory 
accesses as does the baseline system. When memory sizes are 
exceeded by either data or program but not by both together, the 
penalty is a slight slowdown, not an inability to run. Memory 
controls are slightly more complex, since program and data 
accesses will interfere whenever either overflows its normal half 
of the memory. 
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6.13 PROCESSOR PROGRAM MEMORY SIZE, CONTROL UNIT MEMORY SIZE 

The processor program' memory (8k words) was chosen to adequately 
hold the aerodynamic flow model programs. Overlay of code from 
CUM is easy and quick, and allows PPM to be smaller than' the 
entire code file. However,' PPM should be large enough so that 
overlay is not so frequent as to interfere with throughput. 

An overlay capability can be provided so that program can overlay 
into CUM from DMB, via a buffer area in EM. Since such overlay is 
not needed for the flow model, it was not proposed as part of the 
initial capabilities of the operating system. 

For a different spectrum of applications, larger code files and 
different sequences of execution may be encountered. Hence, the 
code storage capabilities of the- FMP may have' to be reevaluated i-f 
there is a change in the spectrum of applications. 

6.14 EXTENDED MEMORY SPEED, TRANSPOSITION NETWORK SPEED 

The’ baseline system extended memory is constructed* -of 64k-bit RAM 
chips, operated at the fastest reasonable cycle time available at 
the time- the FMP is constructed. It was projected for the 
baseline system that the cycle time would.be on the order of 200 
to 250 ns for the chip, and that therefore a cycle time for the EM 
module of 280 ns was appropriate. 

If the 64k-bit chip is in fact significantly faster than that, EM 
would be designed faster to match the chips. But, to go. faster 
than allowed by the 64k- bit chips will require the use of 16k-bit 
RAM chips, a four-fold increase in memory chip count from 28,655 
chips to 114,620, a 43% increase in the chip count in the FMP and 
a distinctly adverse effect on availability and cost. 
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The point to be determined by the tradeoff is v^hether to increase 
in throughput from using 16k-bit chips is worth the extra cost, 
additional failures, and extra power of using 16k-bit chips in the 
EM modules. 

The results of this tradeoff will be a function of how much 
computation is accomplished per fetch from extended memory, which 
is very dependent on the specified spectrum of applications. It 
was clear that for the aerodynamic flow problems., and almost 
certainly for the meterological problems also, that the 64k-bit 
chips will have more speed than needed. It also appears 
(according to the Electronic Times of November 7), that actual 
64k-bit chips will be faster than those postulated for the 
baseline system. Simulation, using inputs that represent the 
entire spread of intended applications, is the appropriate tool 

for investigating this tradeoff. 

\ 

The TN speed and design will have to be adjusted to match the EM 
speed. Thus, the revision in TN design will also have to be 
factored into the tradeoff. An EM made faster by using 16k-bit 
chips is partially self-defeating, since the wire lengths from EM 
to processor, now about 40 feet, will get significantly longer 
when the EM quadruples in physical size. 

6.15 CONTROL UNIT SPEED 

The speed of the control unit, including the implementation of 
specific instructions such as DIV 521, DIV 512, and MOD 521 that 
are needed for specific CU actions (in this case, calculating EM 
address and TN settings), is best determined by simulation using 
test cases that cover the entire spectrum of applications. A very 
fast MOD 521 instruction has been described by C. R.Vora in U.S. 
patent 3,980,874. Since there is only one control unit in the 
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entire array, the optimum CU design is clearly that one that 
almost never interferes with throughput. On the other^ hand, a too 
fast and hence unnecessair ly complex CU will, have adverse effects 
on reliability and availability, and possibly will also make the 
compiler design more complex if some of the complexities require 
cooperation from the compiler to be effective. This optimum CU 
design is a function of the spectrum of applications. 

6.16 SCALAR PROCESSOR 

6.16.1 Dependency on Spectrum of Applications 

The FMP has been described as an array of 512 processors and a 
control unit. The’ control unit concerns itself with synchroni- 
zation, some address calculation, and loop control. All floating 
point arithmetic is done in the array. Aerodynamic flow models 
are well calculated on this machine. However, there are other 
applications, which do not have sufficient parallelism almost 
everywhere in the algorithm to be efficiently computed on this 
machine. If it is desired to broaden the spectrum of applications 
of the FMP, it is desirable, for some applications, to furnish a 
scalar processor to take over those portions of the floating-point 
calculation where most of the processors are idle waiting for a 
few to complete calculations. The term "scalar Processor", as 
used here, refers .strictly to floating point scalar computations. 
Loop control and other program execution control where a single 
decision controls the processing of the entire array has been 
accomplished, on other architectures, by the "scalar processor" 
portion of the equipment. These functions are included as an 
essential part of the control unit, and in so far as they are 
scalar, the control unit is a scalar processor, whether or not 
specific equipment for handling floating point scalars is 
supplied. 
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An evaluation of which applications are going to require the 
addition of a scalar processor for efficient mapping onto the- PMP 
has not been made. It is suspected that the meteorology appli- 
cations are like the aero flow models and will not require a 
scalar processor. Whether a scalar processor is desirable, and 
which of the several options mentioned below for including a 
scalar processor in the design, is a function of the intended set 
of applications, and can therefore be defined properly only when 
NASA defines the amount and -kind of extensibility of scope that is 
desired for the FMP. The baseline system as described includes 
the third of the three design options below. 

6.16.2 Simple Scalar Processor The simplest recipe for providing 
a scalar processor capability in the FMP is simply to provide a 
faster, more powerful processor for processor number 0. The first 
processor is the one that will be assigned to vectors of length 
one; and which will be executing processor code when the compiler 
can find no parallelism. Thus, without doing anything special to 
the compiler, we gain some scalar capability by simply making the 
first processor a faster one. During parallel swatches of code, 
this processor cooperates with the others, and the program does 
not know that it is different. Those swatches of code where 512 
processors are idle take much less time because the first 
processor has been made faster. When short swatches of scalar and 
vector code alternate, overlapping of scalar and vector operations 
occurs. 

6.16.3 Added Processor The simple system does not give the 
scalar processor any particular speedup for accessing EM. It does 
not give the scalar processor any, faster way of handling those 
actions that require cooperation with- the control unit. At the 
expense of complicating the compiler, we can add scalar processor 
hardware that is separately programmed, and which can subsume some 
of the control unit functions for scalar processing. ' 
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Suppose we provide a separate, and different processor, which has j 
its own access to extended memory, and which is designed to 

T 

execute a more nearly independent code stream than that of the 512 
processors iti the array. Figure 6-2 shows a block diagram of the 
FMP with such a scalar processor represented. Langauge extensions 
and programming methods for using such a capability will have to 
be defined. 

Extended memory is "core" for the FMP. The amount of accessing 
into extended memory by the scalar processor may be such that 
extended memory speed will be a bottleneck for those applications 
that make extensive use of the scalar processor capability. 

Hence, for some range of applications, a faster extended memory 
(and hence one with fewer bits per chip), must be provided. Using 
16k-bit chips instead of 64k-bit chips, for more rw speed, 
increases from 29,176 memory chips to 116,704 memory chips, an 
increase of 44% of the package count of the entire NSS. 

The added processor has LOADEM and STOREM instructions in its 
instruction stream which do not require the cooperation of the CU, 
merely contend with it for access to the extended memory. The 
synchronization between the added processor and the CU is thereby 
reduced, while requiring the compiler to determine when synchroni- 
zation is required for correct execution of the program. Scalar 
processing and vector processor on the same data must be done in 
the correct order. 

6.16.4 Enhanced Control Unit It has been suggested that scalar 
processor capability can be achieved by adding floating point 
instructions to the control unit. This also may imply that the 
control unit be speeded up from its no-scalar-ppocessor design so 
it has the free time to perform as a scalar processor. The 
discussions about accessing EM apply to this option as well as 
they apply to the previous one. 
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Figure 6-2. Added Scalar Processor 
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6,16.5 Recommendation Simulation of various programs across the 
entire spectrum of applications is recommended .as- ^ means of 
determining which of the several recipes for providing a scalar 
processor is to be adopted, if any. The budget for compiler 
writing is also to be consulted, since the separate processor 
requires additional decisions on the compiler's part, .as well as 
additional language extensions perhaps. 

6.17 MARGINAL CHECKING 

A Strategy for weeding out incipient failures in electronic 
equipment is to vary some parameter up and down from its nominal 
value, measure the margins., and determine when those margins are 
deteriorating, and what the faulure mode is at which they fail. 

The parameter being varied can be supply voltage, clock frequency, 
temperature, or anything else that appears to affect operation. 

It has been determined that marginal -checking is useless for 
wdrst-case designed, digital circuits. However, as noted in the 
final report, LSI cannot be worst-case designed in the conven- 
tional sense, and marginal checking may be valuable for weeding 
out those low-margin . LSI packages that have a higher than normal 
transient error rate.' 

6.18 COMPONENT TECHNOLOGY 

The speed of any given system architecture is ultimately limited 
by .the performance of the circuit from which it is assembled. The 
final component choice for the FMP will weigh carefully the trade 
off of speed (and power) consideration against the risk and cost. ■ 
The inital procurement cost of a more advanced technology pro- 
viding more desirable performance is easily measured. It is 



usually shown that the initial cost of more advanced circuit are 
easily justified in overall system performance improvements. 

(Thus reducing the cost per operation.) However, the risk in 
selecting a more advanced and higher performance circuit 
invariably may be considerable, v?ith potential for affecting the , 
production of system being built in a number of ways; • 

* The delivery may be slow due to low yields. 

* Failure rates may be higher than anticipated. 

* The performance characteristics of devices made in pro- 
duction may be degraded from the original developmental 
samples and design goals. 

* Low usage may discourage development of second sources, and 
result in continued elevated prices. 

* Unforeseen application problems discovered only during 
system checkout could require redesign or retrofit. 

y 

It would be very desirable from a system performance point of view 
to be able to use the fastest circuits possible. However, the 
possible risks that accompany this choice make it imperative that 
a very careful tradeoff analysis be conducted given the choice of 
a mature, slow (but adequate) speed technology and an advanced 
faster speed technology. 

6.19 EXPANSABILITY 

By expansibility we mean generalizability and expandability. The 
NASF design has many features allowing an upward compatible second 
copy, as well as features allowing the upgrading of the NASF 
itself. This section lists some of the areas in which 
expansibility is found. 

6.19.1 Address Sizes The address sizes are uniformly larger than 
the memories they address, allowing the memories to be replaced by 
larger ones. 
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Data Base Memory holds 134 million words (2^^ and is 
addressed by the control unit whose register size is 32 bits. 

Extended memory holds 34 million words (just over 2^5) and is 
addressed by processor (32-bit integers) and control unit (32 
bits) . 

Control unit memory holds 32k words (2l^) and is addressed bsy 
the control unit whose integers are 3 2 bits long. Care 
be exercised not to insert 16-bit address register that 
cannot be expanded. 

Processor data memory holds 16k words (2^^) and has a 16-bit 
address. A four-times expansion of PDM is thus permitted. 

Processor program memory holds 8k words (2l^) and has a 
16-bit address. 

Upgrades by replacing the memories with larger ones are therefore 
very feasible. 

6.19.2 Transfer Rates There are a number of options for 
increasing the transfer rates between portions of the FMP. Many 
of these are discussed in other paragraphs in this section, and 
clearly, new transfer rates could be chosen for any new design, 
depending on the results of tradeoff studies. As a retrofit, the 
easiest area to increase transfer rates is in the DBM-EM 
transfers. This is fortunate , since if some virtual memory 
scheme is implemented, this is the area of the baseline design that 
may have to be improved. Each EM module has a one-word buffer, so 
no EM changes at all are required for increased transfer rates, 
just increased parallelism is the accessing of these buffers. The 
DBM would have to be reconfigured for increased parallelism, 
assuming that current projections about CCD shift rates are 
correct. 
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6.19.3 Memory Size The address space allows increased memory 
size. The need for increased memory size could arise from a 
number of causes. CUM is required to hold enough program (both-CU. 
and array processor program) to keep the array busy for a 
reasonable amount of the time between program overlays from DBM. 
Thus, complex programs may require increased- CUM. size.- 

PDM size is the result of the, requirement for temporary variables, 
and sometimes, for buffering data fetched from EM. The required 
PDM size is therefore applications-dependent . We believe that the 
aerodynamic flow problem requires a larger-than-'typical PDM, and 
that larger PDM's are unlikely. However, the expansion opportu~ 
nity is there. 

PPM, on the other hand, must hold enough program to keep the 
processors busy for a reasonable time between overlays from CUM. 

For problems, like the aerodynamic model, where there is an inner 
loop, this implies that at least the inner loop be contained 
within the PPM. Overlay from CUM is fast, and this will allow 
reasonable efficiency even when this is not true. 

DBM, the window in the computational envelope, must be large 
enough to hold results from the last job, space for the current 
job, and the objects being assembled for the. next. job. If job 
sizes are to grow, expandability of the DBM is a requirement. 

6.19.4 Upgrades via Software Upgrading capability, by adding 
features to the software, can be accomplished without any hardware 
changes. The initial software is configured around the 
areodynaroic flow model requirements, A number of features, not 
required by the aerodynamic flow models, can be added to handle a 

I 

broader range of requirements, including; 
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* Windowing of data for executing jobs whose files exceed 
the size of EM. 

* Language extensions, including, such things as subscripted 
subscripts, linear recurrences on the parallel subscript, 
and so on . 

* Vectorizer, to analyze nonparallel FORTRAN and produce FMP 
FORTRAN for Operation on the parallel machine. 

* Multiprogramming capability on the FMP. Proper implemen- 
tation of multiprogramming may require hardware additions 
as well. 
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APPENDIX A 


Preliminary Compiler Algorithms for Setting the Transposition 
Network 

Definition of the FORTRAN extensions and restrictions for the NASF 
requires rigorous definition of the algorithms for setting the 
SKIP and OFFSET of the transposition network and matching them 
closely to the FORTRAN constructs. 

The issues to be addressed in this memo are: 

1. Matching of FORTRAN DOPARALLEL to EM accessing. 

2. Requirements for multiple accessing within a DOPARALLEL 
construct. 

3. Optimization of accessing for single access types. 

As a preliminary step in addressing these issues a more complete 
definition of the DOPARALLEL statement needs to be formulated. 

The DOPARALLEL statement cannot be nested for this results in 
possible programmer error. Rather the DOPARALLEL statement is 
defined to have multiple increment sets. 

i.e. DOPARALLEL J=J1,J2,J3? K=K1,K2,K3 ... 

where Jl = initial value most rapidly varying index 
J2 = final value most rapidly varying index 
J3 = skip distance most rapidly varying index 
Kl = initial value next most rapidly varying index 
K2 = final value next most rapidly varying index 
K3 = skip distance next most rapidly varying index 
(...) ellipses indicates further increment sets 

ENDDO;ENDDO 
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1. Matching Fortran DOPARALLEL to Extended Memory Accessing 


Since the entire set of multidimensional DOPARALLEL statements is 
difficult to discuss, the specific example- of three dimensional 
accessing with a 2 dimensional DOPARALLEL and a single dimensional 
inner loop will be described in detail. For this three 
dimensional case there are 6 possible access patterns for any 
given array corresponding to the possible permutation of the 
indices, 


A(I,J,K) 

A(K,I,J) 

A(J,K,I) 

A{I,K,J) 

A(J,I,K) 

A(K,J,I) 


Case I 
Case II 
Case III 
Case IV 
Case V 
Case VI 


It is necessary for the compiler to determine the SKIP distance 
and the OFFSET of the transposition network for any of these 
accesses for the given DOPARALLEL construct, i.e.. 


EMARRAY A(IFIRST, ISECOND, ITHIRD) 
DOPARALLEL J=1 , JLIM; K=1 , KLIM 
DO 1 1=1 ILIM 

S(i) = Access Case (i) 

1 Continue 
ENDDO; ENDDO- 


The equations for setting the Transposition Network (SKIP and 
OFFSET) are given in Tables lA through 1C. Table Id provides a 
table for determining index parameters. It is assumed, of course, 
that the array has been laid out in memory in the FORTRAN- sense. 

To clarify these equations a complete example is worked out in 
detail in Figures 1-7. The chosen array? A(5,3,7) has extents 
less than the number of memory modules (11) and processing 
elements (10) in a manner similar to that of the NASF problems. 



Equations for 

Transition Network OFFSET Calculations 

Given Quantities 

N = Number of processors 
M = Number of memory modules 

IA0 = Base address of array having index par ameter s^j]B , J0r K0 
IFIRST = extent of first parameter in array 
ISECOND = extent of second parameter in' array 
ITHIRD = extent of third parameter in array 

Determined Quantities from Figure 1 
ICLIM = Total number of cycles 

IDEL = Skip distance associated with I parameter 

JDEL = Skip distance associated with J parameter 

KDEL = Skip distance associated with K parameter 

ILIM = Array extent assciated with I parameter 

JLIM = Array extent associated with J parameter 
KLIM = Array extent associated with K parameter 

Defined quantities 

IC = cycle number 

NN = subiteration number 

Kl = (N* ( IC-1 ) )/( JLIM) + K0 = least rapidly varying index* 

Jl = (N*(IC-1) - (K-K0) * JLIM + J0 = most rapidly varying index* 
IA00 = IA0 + (J-J0)*JDEL + (K-K0)*KDEL 

Transposition Setting SKIP distance = JDEL 

*J1, Kl values for processing element 0 
1st .subiteration 

Table lA 
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-OFFSET Calculation for Transposition Network (Subiteration = 1) 
for given I value 

■IADD(IC,1) = IA00 + (I-I0) * IDEL (address of first element 

to be fetched) 

OFFSET (IC,1) = (IADD(IC,1)) MOD(M) 


-OFFSET Calculation for Transposition Network (all other subiterations*) 
for given I value 

IADD(IC,NN) = IA0 + (I-I0)*IDEL + (K1-K0 + NN-1 ) *KDEL 

(address of first element to be 
fetched on this iteration) 

IP (IC,NN) = (NN-1)*JLIM -J1+J0 

(processor that needs to obtain 

this first element on this iteration) 
OFFSET (IC,NN) = (IADD(IC,NN) - IP( IC ,NN) * JDEL) MOD(M) 


*Subiterations 2 NN NX 

where NX = 2N+l+( JLIM-Jl) +1 

^ i;u«\ 

If (NN.BQ.NX). AND ( K,(NN) . EQ .KLIM) further subiterations do not need to be 
performed. K(NN) is the K index value of the 1st element of the NNth 
subiteration. 


Table IB 
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Parametex Assignments for Arbitrary Array Extents 
and Number of Processors 


CASE ILIM JLIM , KLIM IPEL JDEL KDEL ICLIM 


1 

rr,j,K) 



ITHIRD 

1 

IFIRST 

IFIRST* ISECOND 

(ISECOND* ITHIRD 
+N-1)/N 


ISECOND 

ITHIRD 

* 

■laSCONP 

*X 

IFIRST 


1 

(IFIRST* ITHIRD) 
+N-1)/N 


mm 

IFIRST 

ISECOND 

. IFIRST* ISECOND 

1 

■ IFIRST 

(IFIRST* ISECOND 
+N-1)/N 



ITHIRD 


1 


ISECOND 

•( ISECOND* ITHIRD- 
+N-1)/N 



IFIRST 


IFIRST 

1 

IFIRST* ISECOND 

(IFIRST* ITHIRD) 
• +N-l')/N 

m 


ISECOND 

ITHIIUD 

IFIRST* ISECOND 

IFIRST 

; 1 

(I FIRST* ISECOND 
+N-1)/N 


EM ARRAY A(IPIRST, ISECOND , ' ITHIRD) 
Number of Processors = N 
Table iC 


O Q- 
































Index Value Determination 


TEMP = IADD(IC,1W) ” IA0) - (I-1)*JDEL 


Case TEMP J K IVAL JVAL KVAL 


1 

NO 

J 

K 

I 

J 

K 

2 

YES 

TEMP/JDEL+1 

(TEMP-( J-1)*JDEL) 
/KDEL+1 

K 

I 

J 

3 

NO 

J 

K 

J 

K 

I 

4 

YES 

TEMP/JDEL+1 

(TEMP-{ J-1)*JDEL) 
/KDEL+1 

I 

K 

J 

5 

YES 

TEMP-(K-1)*KDEL) 

/JDEL+1 

TEMP/KDEL+1 

J 

I 

K 

6 

YES 

TEMP/JDEL+1 

(TEMP-( J-1)*JDEL) 
/KDEL+1 

K 

J 

I 


Table ID 


obigwalpageb 

OF POOR 
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Figure 1 details the memory layout, assuming an arbitrary starting 
point for the first element. The remaining Figures show the six 
possible cases. 

Utilizing the equations of Table 1 one can determine all the 
parameters and the SKIP and OFFSET for any case. For example 
taking CASE II (since it is more complex with access A(K,I,J)) the 
parameters are: 

Given Quantities (Table lA) 


N=10 

M=ll 

IA0=19 

IFIRST=5 

ISEC0ND=3 

ITHIRD=7 


Determined quantitites (Table 1C) 


-ICLIM = (IFIRST*ITHIRD+N-1)/N ( 5*7+10-1 )/10 =4 

IDEL=5 

JDEL=15 

KDEL=1 

ILIM=5 

JLIM=7 

KLIM=3 

^ 0 , J0, K0=1 


Assume that one wishes to determine the SKIP and OFFSET and 
subsequently the IVAL, JVAL & KVAL of the indices for the second 
cycle, second subiteration, inner loop index number 3 - i.e. 
transposition setting #12 

Defined Quantities (Table lA) 


IC=2 

NN=2 

1=3 
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Memory Layout for Array A( 5,3,7) 



11 

337 

347 

537 










10 

217 

317 

417 

517 

127 

227 

327 

427 

527 

137 

237 


9 

126 

226 

326 

426 

526 

136 

236 

336 

436 

536 

117 


8 

525 

135 

235 

335 

435 

535 . 

116 

216 

316 

416 

516 

Address 

7 

434 

534 

115 

215 

315 

415. 

515 

125 

225 

325 

425 

within 

6 

314 

414 

514. 

124 

224 

324 

424 

524 

134 

234 

334 

Memory 

5 

223 

323 

423 

523 

133 

233 

333 

433 

533 

144 

214 


4 

132 

232 

332 

432 

532 

113 

213 

313 

413 

513 

123 


3 ' 

531 

112 

212 

312 

412 

512 

122 

222 

322 

422 

5 22 


2 

411 

511 

121 

221 

321 

421 

521 

131 

231 

331 

431 


1 

X 

X 

X 

X 

X 

X 

X 

X 

111 

211' 

311 


0 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 

X 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 , 


Memory Modules 


No. Memory Modules = 11 ‘ 

No. Processing Elements = 10 

Absolute address AJ? = 19 

Memory Module No. M# = 8 =■ (19) MOD 11 

Address in Module A# = 1 = (19) DIV 11 

Address of any element AE# == Address A(L1, L2, L3) 

■A^ + (Ll-1) + 5x(L2-l) + 5 X 3(L3-1) 

Figure 1 
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Case I 


EMARRAY A{5,^) 
DOPARALLEL J=l,3? K=l,7 
DO 1 I = 1,5 
SI = A(I,J,K) 

1 CONTINUE 
ENDDO 
ENDDO 


SKIP = JDEL = 5 


Setting Sub PE NUMBER 


Number 

Cycle 

Iteration 

OFFSET . 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

ADD 

1 

1 

1 

8 


B 

121 

131 

112 

122 

132 

113 

123 

133 

114 

19 

2 

1 

1 

9 


211 

221 

231 

212 

222 

232 

213 

223 

233 

214 


3 

1 

1- 

10 


1^ 

321 

331 

312 

322 

332 

313 

323 

333 

314 

21 

■ 4 

1 

1 

0 


411 

421 

431 

412 

422 

432 

413 

423 

433 

414 

22 

5 

1 

1 

1 


511 

521 

531 

512 

522 

532 

513 

523 

533 

514 

23 

6 

2 

1 

3 


124 

134 

115 

125 

135 

116 

126 

136 

117 

127 

69 

7 

2 

1 

4 


224 

234 

215 

225 

235 

216 

226 

'236 

217 

227 


8 

2 

1 

5 


324 

334 

315 

325 

335 

316 

326 

336 

317 

32? 

71 

9 

2 

1 

6 



434 

415 

425 

435 

416 

426 

436 

417 

427 

72 

10 

2 

1 

7 


524 

534 

515 

525 

535 

516 

526 

536 

517 

527 

73 

11 

3 

1 

9 


137 










119 

12 

3 

1 

10 


237 











13 

3 

1 

0 


337 










121 

14 

3 

1 

1 


437 










122 

15 

3 

1 

2 


I33H 










123 


Figure 2 
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Case II 


EJA ARRAY A(5,3,7) 

DOPARALLEL J=l,7; K=l,5 
DO 1 I = 1,3 
S2 = A(K,I,J) 

1 . CONTINUE 
; ENDDO 
: ENDDO 

SKIP = JDEL = 15 


Setting Sub PEH Number Assigned 

Number Cycle Lter OFFSET 0123456789 ADD - PE# 


1 

1 

. 1 

B 



111 

112 

113 

114 

115 

116 

117 




19 

0 

2 

1 

OL 

mm 








\211 

1 212 

213 


7 

3 

1 

1 



121 

122 

123 

124 

125 

216 

217 




25 

0 

4 

1 

2 

■ 8 








|22]J 222 

223 , 

26 

■ 

5 

1 

1 


131 

■ 132 

133 

134 

135 

136 

137 




30 

■ 

6 

1 

2 









233 

232 

233 

31 _ 

■ 

7 

2 

1 


12141 215 

216 

217 







65 

■IK 

8 

2 

2 

5 

mm 




311 

312 

313 

314 

315 

316 

31 


• 9 

2 

1 

^ 4 


225 

226 

227 







70 


10 

2 

2 

10 





321 322 

323 

324 

325 

326 

26 


11 

2 

' 1 

9 

|234 

235 

236 

237 

Q 






75 


12 

2 

2 

' 4 





332 

333 

334 335 

336 

27 


13 

3 


1 












111 

0 

14 

' 3 


7 


411 

412 

413 

414 

415 

416 

417. 


22 

1 

15 

3 


2 









f 511 

512 

23 

8 

16 

3 


6 

327 










116 

0 

17 • 

3 

2 

1. 


421 

422 

423' 

424 

425 

426 

427 


21 


18 


3 

7 









l52l| 

522 

28 

■ 

.19 

3 

1 

0 

1337 










121 


20 

3 

2 

6 


431 

432 

433 

434 

435 

436 

437 . 


32 

B 

21 

La... 

3 

1 









l53ll 

53,2 

33 

B 

22- 

1 4 

1 

■1 

1 

513 

514 

515 

516 

517 






53 

0 

23 

4. 

1 


1 

532 

524 

-5 25 

5 26 

527 






58 

0 

•24 

I_J 

1 

■1 

1 

533 

534 

535 

536 

537 






63 

0 


Figure 3 
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Case III 


EM ARRAY A(5,3,7) 

DOPARALLEL J=1 , 5 ; K=l,3 
DO 1 I = 1,7 
S3 = A(J,K,I) 

1 CONTINUE 
ENDDO 
. ENDDO 

SKIP = JDEL = 1 


Setting Sub 


PEM Number 


Assigned 


Number Cycle Iter OFFSET 0123456789 ADD PE# 



Figure 4 

ORIGINAL PAGE IS 
OF POOR QUALiry 
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Case IV 


EM ARRAY A(5,3,7) 

DQPARALLEL J=1 , 7; K=1 , 3 
DO 1 I = 1,5 
S4 = A{I,K,J) 

1 'CONTINUE 
! ENDDO 

■ ENDDO 
; 

' SKIP = JDEL =15 

1 



Figure 5 
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Case V 


EM ARRAY A{5,3,7) 
DOPARALLEL J=1 , 5; K=7 
DO 1 I = 1,5 
S5.= A(J,I,K) . 

1 CONTINUE 
ENDDO’ 

ENDDO 


SKIP = JDEL = 1 


Setting Sub PEjl Number Assigned 

Number Cycle Iter OFFSET 01234567 89 ADD PE# 


1 

■■ 


8 

111 

211 

311 

411 

511 





19 

0 

2 



7 





ill2| 212 

312 

412 

512 

34 

5 

3 



2 

121 

221 

321 

421 

5 21 





24 

0 

4 



1 





Il22 

222 

322 

422 

522 

39 

5 

5 



7 


231 

331 

431 

531 





44 

0 

6 


2 

6 






232 

332 

432 

532 

59 

5 

7 

2 

1 

5 
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213 

313 

413 

513 ^ 





49 

0 

8 

2 

2 

4 





Im 

214 

314 

414 

514 

69 

5 

9 

2 

1 

10 

123 

223 

323 

423 
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54 

0 

10 

2 

2 

9 





124 
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524 

69 

5 

11 

2 

1 

4 

133 
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0 
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Figure 6 
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Case -VI 


EM ARRAY A(5,3,7) 
DOPARALLEL J=1 , 3; K=1 , 5 
DO 1 I = 1,7 
S.6' = A(K,J,I) 

1 CONTINUE 
ENDDO 
ENDDO 


SKIP = JDEL = 5 


Setting 

Number 


Cycle 


Sub 

Iter 


OFFSET 0 


PEM Number 
3 4 5 6 


■ Assigned 
ADD PE# 


■ 1 
2 
■ 3 

4 
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12 
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29 
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41 
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Kl=(10+l)/7+l=2 

J1=1Q-1*7+1=4 

SKIP=JDEL=15 

Using the OFFSET calculation equation for NN=2 in Table IB one 
obtains 

IADD(2,2)=19 + (3-l)*5 {2-1 + 2-l)*l 
= 19 + 10 + 2 = 31 
IP (2,2) = (2-1) * 7-4+1 = 4 

OFFSET {2,2} = (IADD{2,2) - IP(2,2)*15) MOD (11) 

= (31-4*15) MOD (11) 

= (-29) MOD (11) = 4 

This OFFSET calculation may appear strange at first glance. Since 
one wishes this element to be produced in processing element 4 one 
needs to determine what the "virtual" address of the array element 
would have been to put an element into processing element 0. 
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Mode bits for PE's #0,1, 2, 3 will produce null fetches. 


Having now determined the SKIP and the OFFSET one may wish to 
determine the specific indices of the element. This is done by 
means of Table ID. 

Temp = (31-19) - (3-1) *5 
= 12 - 10 = 2 
J = 2/15 + 1 = 1 
K = (2 - (1-1) *15)/1 +1=3 
A(IVAL, JVAL, KVAL) = A(K,I,J) = A(3,3,l) 

In a, similar fashion one can determine the SKIP and OFFSET for any 
setting number for any of the six possible cases. Additionally 
Table II gives a listing of a computer program which performs 
these computations.* Representative output is given in the appendix 
for the set of cases listed below. 


IAJ3 Mem Mod #PEs IFIRST ISECOND ITHIRD 


19 

11 

10 

5 

3 

7 

19 

11 

10 

9 

5 

6 

19 

11 

10 

6. 

2 

8 

27 . 

13 

11 

6 

2 

8 


2. Requirements for Multiple Accessing within DOPARALLEL 
Construct'. 

The compiler will recognize if a variety of access types occur 
within a given DOPARALLEL and will modify the basic access 
algorithm. For example given 


*Note this is a very preliminary algorithm and should not be considered 
"proven" software in any sense. • 
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ooonoQOOooo oooonooooooooooooonno ooonnooooono 


Table 


SSET LIST 
SFESET FREE 

FILE 5=C0MP ILER/D ATA*LMT=D IS K » FEC0fl0 = l A » 6L GCK ING* 30 


FILE 6=FILE6i.LNIT=PRINTER 

*********************** *.**<***********«Mt****H*** **<*****«♦*** *'*,« * * ** 


BIRFOLGH* S COfP CRAIION 

CCMPILEF ALGORXTtam p CP DETERHINING- 

TFANSPOSIT I CN NETVi.CFK SETTINGS 

OFFSET ANC SKIP DISTANCE' FOF 3" C ARRAYS 


«*«***********«««*****4*4««ft4**«***«l*********«*««*«*«**4*4*«*«4«««« 


DIM ENSIGN I ADD ( 10,10) , I P( 10,1 0)* I SET (10* 10) 

*************«***««**************************. *****«««*ik**** ********** 

INPLT YARIA8LES 

ITYPE - ARRANGEHENT OF AR.FAY INDICES 
IT YPE = 1 (I* J,K ) 

ITYPE = 2 (K,I,J) 

ITYPE = 3 ( J,K,I ) 

ITYPE = A a*K*J 3 
ITYPE =5 ( J, I,K ) 

ITYPE = 6 (K, J, I ) 

lAO = BASE ADDRESS OF ARRAY A WITH INDICES IO*JO-,»KO 
M '= MMBEF .CF MEM QRY , MODULES 

N = MJM8EF CF PROCESS INC- ELEMENTS 

IFIRST= ARRAY EXTENT OF FIRST OIMENSICN OF A=FAY 
ISECNC= ARRAY EXTENT OF SECCNC DIMENSION CF ARRAY 
IT'HIRD= ARRAY EXTENT OF THIRD OIMENSICN OF ARRAY 
JO = index VALLE OF BASE ACORESS 

KO = INDEX VALLE OF BASE ADDRESS 


******************************************************************** 
READ (5, 10 0) irYPE,IAO,H,N,lFlfiST ,ISECND* ITHIRD*KO,JO 
HRITE(6,lll ), ITYPE, IA0,M,N,IFIRST*ISECNC»ITHI:FD*K0;4J3 
**************************************************** 


SET UP CF INITIAL PROBLEM PARAMETERS INDEPENDENT 
OF CRCERING; , 

TDEL, JOEL,KD£L = SKIP DISTANCES 
ILIH, JLIM,KUIM = ARRAY EXTENTS 
ICLIM = NUMBER OF CYCLES 


******************************************** *■*•* *-* *«*«***«* ******** 
IF( ITYPE-E fl-1 ) 60 TC 1 
IFUTYPE.EQ.2) GO TC 2 
IF(ITYP£-£Q.3) go TC 3 
IFdTYPE.ES.A) 60 TC A 
IF(ITYP£.E0.5 ) 60 TC 5 
IF(ITYPE.EQ.6) GO TC 6 
I FUTYPE.LT.l ) GO TC 7 
IFUTYPE.GT.S) go to 1 
ICEL = 1 
JCEL = IFIRST 
KCEL = IFirST*ISECNC 
ICLIH= < ISECND* ITHIFD «N-1 )/( N) 

JLIH = ISECND 
ILIM = IFIRST 
KLIM * IThIPO 
GC TO 8 
I.OEL = IFIRST 
JCEL = IFIRST*ISECNC 
K.DEL = 1 

ICLIM= < IFIFST*ITHIRD+N-1 )/{ N) 

JLIH = ITHIRO , 

ILIM = ISECND A-IY 


OE EOOE 



ortonooo oooooooo oooooooo 


KLIM = IFIRST 
GG TO 8 

ICEL = IFIRST*IS£CNC 
1 

IFIRST 

( IFIRST* ISECND 4N -I )/( N) 
IF IRST 
ITHIRD 
ISECND 


JOEL = 
KCEL = 
ICLIM = 
JLIM = 
ILIH = 
KLIM = 
GO To 


8 
= 1 


at : 

KEEL = 

ICLIM = 

JLIM = 

ILIM = 

KLIM = 

6C TO 8 
I DEL = IFIRST 


IFIR-ST,*ISECNC 

IFIRST 

(ISECND* ITHIFO *N-1 )/( N) 

ITHIRD 

IFIRST 

ISECND 


JCEL = 
KCEL = 
ICLIM = 
JLIM = 
ILIM = 
KLIM = 
GC TO 


IF IRST*ISECNC 

(IFIRST* ITHIR0*N“1 )/( N) 

IF IRST 

ISECND 

ITHIRD 


8 


IF IRST 
1 

IFIRST-ISECND 

(IFIRST* ISECND 4N -I )/( N) 

ISECND 

ITHIRD 

IFIRST 




JCEL = 

KCEL = 

ICEL = 

ICLIM = 

JLIM = 

ILIM = 

KLIM = 

GG TO 8 

7 MfiITE(6#l01I 

Cl F0RMAT(2-X»»y0l HAVE AN ERFOF IN ITYFE*) 

GC TO 80 

e HfiITE(6*ll2J IDEL* JCEL*KDEL.»ICLIM,JLIM»ILIM 

MRITE( 6*114 ] 

********************* *'* *************************************** 


START OF CYCLE LOOP 


4A**************** ******** ** ** ******«*«***4*4******* ****** **«* 

00 10 IC = 1*ICLIM 
IVV= N*(IC-1) 

K=(IVV)/(JLIM ) 4 KO 
Kl= K 

J=(IVV)-(K-I)*( JLIM ) 4 JO 
Jl=J 

IAOO= IA04(J-1)*JCEL4(K -1 )-*KDEL 
HRITE(6»113) IC*IVV*J*K 

******************** ** ****** ********** «*«.*•* «««*«* «***««** «*«* 


START CF INNERMOST LOOP INCEX I 


t****'********************************************************* 

DO 20 I = 1*ILIM 
lAOOCIC#!) = lA'OO 4-(I-1)*IDEL 
IP( IC*1 I = 0 

**«*****«*«*«*« *'«* ** ******** ****** ********** *«****«*«***** «** 


SU8ITERATICN LCOPS 


** «* ** 4* ****** 4* ** **** *.* ***.***************** 

DO 30 NN= 1*N 
N2=NN-l 

IF (NN.EQ.l ) GO TC 9 

IAOD(IC*NN) = IAO*( I- 1) *I CEL 4 ( K 1- UN 2 ) *K DEL 
IP(IC*NN) = N2*JLIM-Jl4l 
9 CONTINUE 

H RITE( 6*100 ) IC*NN*IACO(IC*NNJ *IP(IC*NN) 

IF(IP( IC*NN )-GT-N-l) GO TC 20 

ISET (IC*NN}= (IADC(IC*NN)-IP( IC*NN)*JCEL ) 


«« II* « 
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oooooooo n onnnnn 


**<****■***«*****•**'*•***•*****••'****•*****'* 


adjusting iset tc positive number 


DC 40 KAP= 1#10C 
If<ISETUC»NN).GE.O ) GO TC 50 
ISETCIC^NN) = ISETUC^NN) + M 
40 CCNTINUE 

HRITE(6»100) IC>NN»ISET(IC >NN) 

50 ISETCIC^NN) = ni CD( I S£ T ( I C» NN ) » M ) 


CETEfiM INATION CF INDEX VALLES "* F rL 

NOT REOUIREO F CR OFFSET ANC SKIP DE TERM INA TI CKS 


Of SET-- 


IF( ITYPE.EO.1) GC TO 201 
IFCITVPE.EQ.Z ) GC TO 202 
lF(lTYPE.Ea.3) GO TO 2C3 
IFdTYPE.EO .5) G C TO 205 
IFUTYPE, EQ.6) GC TO 206 

01 IVAL=I 
JVAL=J 
KV AL=K 

GO TO 207 

02 TEHP = <IACDCIC»NN)-IA0 - (I-13*I0EL} 

J=TEMP/JEEL 4 1 

K= (TEMP -(J-1 )*JD£L )/KOEL ♦ I 
IV AL=K 
JVAL=I 
KV AL=J 
60 TO 207 
?03 IVAL^J 
JVAL=K 
KV AL=I 
60 TC 20 7 

?04 TEMP = UACO(IC»NN)“IAO - (I-1)*IDEL) 

J=TEMP/JCEL 4 1 

K= (TEMP -CJ-l )*JD£L J/KDEL 4 1 

IV AL= I 

KVAL=J 

JV AL=K 

GO TO 207 

205 TEMP = ( lA C0( IC >NN)-IAO - (I-1)*I0EL) 

K=TEHP/KCEL 4 1 

J=( TEMP-(K-1 3*K0EL)/ JOEL 4 I 

IV AL=J 

JV AL= I 

KVAL-K 

GO To 20 7 

206 TEMP=( lA CD( IC»NN3- lAO - (I-1)*IDELJ 
J=T£mP/JCEL 4 1 

K= (TEMP -( J-1 )* J0EL)/KDEL 4 I 
IV AL=K 

****<*******4***-«***********<******4****«****'*****4**44'*4***** 

END Of JNCEX COMPUTATIONS 


********^*******«*******»*«***<*********<4****«**4444*4***«** 

JVAL=J 
KVAL= I 

207 NUM = NUM 4 I 

IF(NN.E0.1 ) GO TC 31 

IF((IAOC( IC»NN)-IA0D( IC*1 IJ.EO .JDEL4JLIM*N21 GO TC 20 
51 «RIT£(6*115) NLH>IC#NN<*lSET(lC»NN)dVAL»JVAL»KVAL 

IF((ITYP£.£Q*n.0R.(ITYPE.E0.3) ) GO TC 20 
If((IC4l).EQ .ICLIH) GC TO 30 
If (K.CQ.KLIM) GC TO 20 
30 CCNTINUE 
2(3 continue 
10 CCNTINUE 
100 FCR«AT(2Xfl2I5) 

111 f QRHAT(5X^» ITYPE* »• lA 0 •,»MEMCD«** *P£S * -* IFIRST»» 


ISECOND* 


ITH1RD%» 


J0*»//5>»4I5 41 I7»l I? #3 17// ) 
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112 FCRHAT(5X»* 10EL«,» JCEL*»* 

C» ILIM* /5X»6I6/ ) 

11 3 FQRMAT(5X,*CYCLE* r*SUeiTEB*»» 
llA FCRHATCSX^* NUK'r' CVCLE*>* 

C • JV AL»»* KVAL* ) 

115 FCRMAT(5X» 215,1 19, 1 17, 3X , 31 6) 
eo CONTINLE 
END 


KDEL*,* ICLI'm*)** JLIM 

J * ,* i(*/5X,4I6] 

SUeiTEF',* CFFSEt*,* 


IVAL*, 
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DOPARALLEL J=1 , JLIM; K=1 , KLIM 
DOO 1 1=1, ILIM 
SI = A(I,K,J) * A(K,I,J) 

1 Continue 
ENDDO: ENDDO 


it is Obviously required that for a given J,K pair that a specific 
processing element must receive both of them. If one considers the 
previous example and determines the assigned processing element 
for 


Type I A{ 3,2,5) PE# 3 
Type II A{ 2,3,5) PE#1 


But this is wrong. Both of these accesses must go to the same 
processing element. The solution to this apparent dilema is to 
expand the array size at compile time by “squaring" it if one of 
these type accesses occurs, anywhere in the program, i.e. given 

the array A{5,3,7) with extents 5,3,7 

one expands it to square by increasing all extents to the largest 
one, i.e., 7 and accessing the array as though it were of size 
A (7,7,7). 


This is demonstrated in detail in Figure 8A&B for all 6 accessing 
patterns. The I index, the innermost, is not iterated for each- 
cycle, As is obvious one obtains the correct J,K pair in each 
processing element as is required. The appendix contains the 
examples listed below. 


1A0 

Mem Mod 

#PSs 

IFIRST 

ISECOND 

ITHIRD ■ 

19 

11 

10 

3 

3 

3 

19 

11 

10 

5 

5 

5 

27 

13 

11 

6 

6 

6 

19 

11 

10 

7 

7 

7 
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i-13 il4 il5 il6 il7 


125 126 127 



9 


Specific Examples 

• :-(l,3,2) = O 
22 123 = O 

35 136 

m i52 . 

6 4'~i -66 


(2,1,3) =0 

12 312 (2,1,1) =0 
13 


15 215 

16 516 


121 231 
1 51 361 


ill 521 




LZi 


(3.2.1) =0 

( 1 . 2 . 1 ) 
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3. Optimization of accessing for Single Access Type 


If a single type of access occurs within a DOPARALLEL construct and 
is one of the less favorable ones then the compiler will reverse 
the order of the DOpARALLEL construct. Case I and ITI are already 
optional. Case IV and VI would be inverted, i.e., the construct 
would be DOPARALLEL K=1 , KLIM; J=1 , JLIM. 

Cases III and V would reamin as written with a warning to the 
user . 


A-24 



Appendix A 
Normal Accessing 



ITYPE. lAO HEHOD #PES IFIRST’ fSECONO" ITHIRD KO 


IDE.L JDEL KDEL IClIM JLIM ILI'^ 

1 5 15 3 3 : 

NUM CYCLE SUEITER OFFSET I V AL JVAL K AL 


ITYPE lAO HEMQC #P£S IFIPST ISECCNC IThlRC KO 
2 19 11 1C 5 3 7 1 

I'DEL JOEL KDEL ICLIH JLIM ILIM 

5 15 1 4 7 3 

NUM CYCLE SUEITER OFFSET IVAL JVAL KVAL 


ITYPE lAO HEMOO #PES IFIRST ISECCNC ITHIPD KO 


19 11 


IDEL JOEL KCEL ICLIM JLIM ILI ' 

15 I 5 2 V 7 

NUM CYCLE SUEITER OFFSET IVAL JVAL K AL 
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ITYPE lAO MEMQD #P£S IFI6.ST I5EC0N0 ITHIRD KO 


IDEL JDEL KCEL ICLIM JLIM ILIM 
I 15 5 3 75 

NUH CYCLE SUEITER OFFSET IVAL JVAL KVAL 


ITYPE lAO MEHOC #FES IFIRST ISECOND ITHIRD KO 


lOEL JDEL KDEL ICLH JLIM ILI ' 

^ 1 15 4 5 S 

NUM CYCLE SUEITER OFFSET IVAL JVAL K AL 


A -On 



ITYPE lAO MEMOD #PES 
6 19 11 1C 


IFI.RST ISECOND ITHIRO KO 
5 17 1 


JC 


IDEL JOEL 
15 5 


KOEL ICLIM 
1 2 


JLIM ILIH 
3 7 


1 

2 

3 

4 


7 

e 

g 

IJ 

11 

12 

13 

14 

15 

16 
17 
le 
1 - 
20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

17 

38 

39 

40 

41 

42 


CYCLE 

1 

1 

1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 

2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 
2 


SUeiTER 

1 

2 

3 

4 
' 1 

2 

3 

4 

1 

2 
7 

4 

1 

2 

3 

4 
1 

2 

7 

4 

1 

2 

7 

4 

1 

2 

3 

4 
1 
2 
1 
2 
1 
2 
1 
2 
1 
2 
1 
2 
1 
2 


OFFSET 

e 

«; 

2 

1C 

1 

9 

6 

3 

5 
2 

1C 

7 

9 

6 

3 
C 
2 

10 

7 

4 
6 

3 
C 

8 
1C 

7 

4 

1 

c 

2 
9 
F 
2 
IC 
6 
3 
1 C 
7 

3 
G 
7 

4 


IVAL 

1 

2 

3 

4 
1 

? 

3 

4 
1 
2 

3 

4 
1 
2 

3 

4 
1 
2 

3 

4 
1 
2 

3 

4 
1 


4 

4 

5 

4 

5 

4 

5 

4 

5 

4 

5 

4 

5 

4 

5 


JVAL 

1 

1 

1 

1 

1 

1 

1 

1 

i 

1 
1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

n 

1 

2 
1 
2 

1 

3 

1 

2 

1 

2 

1 

•? 

1 


KVAL 
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ITYPE 

lAO H£MCD #PES 

IFIRST 

ISECCNC 

ITHIRC 

1 

19 

1 1 10 

9 

5 


6 

lOEL 

JOEL 

KOEL ICLIH JLIH 

ILIH 



1 

9 

45 

3 5 

9 



HUH 

CYCLE 

subiter 

offset 

lYAL JVAL 

KVAL 

1 

1 

1 

8 

1 

1 

1 

2 

1 

1 

9 

2 

1 

1 

3 

1 

1 

10 

7 

1 

1 

4 

1 

1 

0 

4 

1 

1 

5 

1 

1 

1 

5 

1 

1 

6 

1 

1 

2 

6 

1 

1 

1 

1 

1 

3 

1 

1 

1 

€ 

1 

1 

4 

£ 

1 

1 

9 

1 

1 

5 

9 

1 

1 

1 0 

2 

1 

10 

1 

1 

3 

1 1 

2 

1 

0 

2 

1 

i 

12 

2 

1 

1 

7 

1 

3 

1 3 

2 

1 

2 

4 

1 

3 

14 

2 

1 

3 

5 

1 

7 

15 

2 

1 

4 

6 

1 

3 

16 

2 

1 

5 

7 

1 

3 

1 7 

2 

1 

6 

£ 

1 

3 

1 8 

2 

1 

7 

9 

1 

7 

19 

3 

1 

1 

1 

1 

5 

20 

3 

1 

2 

2 

I 

5 

21 

3 

1 

3 

7 

1 

5 

22 

3 

1 

4 

4 

1 

5 

23 

3 

1 

5 

5 

1 

5 

24 

3 

1 

6 

6 

1 

i; 

25 

3 

1 

7 

7 

1 

5 

26 

3 

1 

8 

£ 

1 

5 

27 

3 

1 

9 

9 

1 

5 


ORIGINAL PAGE IS 
OF POOR QUAUTYi 


A-29 



ITYPE 

lAO MEHCD #PES 

IFIRST 

ISECCN.C 

ITFIRC 

KO 

JO 

2 

19 

1 1 10 

9 

5 

6 


1 

1 

IDEL 

JDEL 

KDEL ICLIM JL IM 

ILIM 





9 

45 

1 

6 6 

5 





NUH 

CYCLE 

SUBITER 

OFFSET 

lYAL JVAL 

K VAL 



1 

1 

1 

8 

1 

1 

1 



2 

1 

2 

3 

2 

1 

1 



3 

1 

1 

6 

1 

2 

; 



4 

1 

2 

1 

. 2 

2 

1 



5 

1 

1 

4 

1 

3 

1 



6 

1 

2 

10 

2 

3 

1- 



7 

1 

1 

2- 

1 . 

4 

1 



8 

1 

2 

8 

2 

4 

1 



9 

1 

1 

0 

1 

5 

1 



1 0 

1 

2 

6 

2 

5 

1 



11 

2 

1 

2 

2 

1 




12 

2 

2 

e 

7 

1 

1 



1 3 

2 


3 

4 

1 

1 



14 

2 

1 

0 

2 

2 

5 



15 

2 

2 

6 

7 

2 

1 



16 

2 

1 

1 

4 

2 

1 



1 7 

2 

I 

9 

2 

7 

5 



18 

2 

2 

4 

3 

3 

1 



1 9 

2 


10 

4 

7 

1 



20 

2 

I 

7 

2 

4 

5 



21 

2 

2 

2 

3 

4 

1 



22 

2 

3 

8 

4 

4 

1 



23 

2 

I 

5 

2 

5 

5 



24 

2 

2 

0 

3 

5 

1 



25 

2 

3 

6 

4 

5 

1 



26 

3 

1 

2 

4 

1 

7 



27 

3 

2 

6 

5 

1 

1 



28 

3 

1 

0 

4 

2 

7 



29 

3 

2 

6 

5 

2 

i 



30 

3 

1 

9 

4 

3 

7 



31 

3 

2 

4 

5 ■ 

3 

i 



32 

3 

1 

7 

4 

4 

7 



33 

3 

2 

2 

5 

4 

1 



34 

3 

1 

5 

4 

5 

3 



35 

3 

2 

0 

c; 

> 

5 

1 



36 

4 

1 

2 

6 

1 

1 



37 

4 

2 

8 

7 

1 

1 



38 

4 

1 

0 

■ 6 

2 

1 



39 

4 

2 

6 

7 

2 

1 



40 

4 

1 

9 

6 

7 

1 



4 1 

4 

2 

4 

7 

7 

1 



42 

. 4 

1 

7 

6 

4 

1 



4 3 

4 

■2 

2 

7 

4 

1 



44 

4 

1 

5 

6 

5 

1 



45 

4 

2 

0 

7 

5 

1 



46 

5 

1 

7 

7 

1 

5 



47 

5 

2 

2 

e 

1 

1 



48 

5 

3 

8 

9 

1 

1 



49 

5 

1 

5 

7 

2 

5 



50 

5 

2 

0 

E 

2 

1 



51 

5 

■? 

6 

5 

2 

1 



52 

5 

1 

3 

7 

3 

5 



5 3 

5 

2. 

9 

E 

3 

I 



5 4 

5 

7 

4 

9 

3 

1 



55 

5 

i 

1 

7 

4 

5 



56 

5 

2 

7 

E 

4 

1 



57 

5 

7 

2 

9 

4 

1 



58 

5 

i 


7 

5 

5 



59 

5 

2 

5 

E 

5 

! 



60 

5 

7 

0 

9 

5 

1 



61 

6 

i 

7 

9 

1 

7 



62 

6 

1 

5 

9 

2 

3 



63 

6 

1 

3 

9 

3 

3 



64 

6 

1 

1 

9 

4 

3 



65 

6 

i 

10 

9 

5 

3 
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ITVPE lAO MEMCD iPES IFIRST 


3 

19 

11 10 

9 

lOEL 

JOEL 

KDEL ICLIM JLIH 

45 

1 

9 

5 9 

NUM 

CYCLE 

SUBITEF 

OFFSET 

i 

1 

1 

8 

2 

1 

1 

9 

r 

1 

1 

10 

4 

1 

1 

0 

5 

1 

1 

1 

6 

1 

1 

2 

7 

2 

1 

7 

6 

2 

1 

8 

9 
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APPENDIX B 


SECDED RELIABILITY IMPROVEMENT MODELS 
B. 1 INTRODUCTION 

The reliability of a computing system can be significantly improved by employing 
single bit error correction and double bit error detection (SECDED) technology, 
which is thus used by the FMP to increase its reliability. 

The report presents a model of reliability improvement assessment of a module 
operated with SECDED. It can be easily embedded in the system reliability 
prediction model. The final result is shown in a mathematical ‘expression. The 
bounds of the reliability and the improvement factor are studied. A computer 
program coded on FORTRAN is also developed and validated, with double 
precision computation. 

B. 2 MODEL 

There are n chips in a module; a chip has m bits. A word which consists of n 
bits can be stored in this module by addressing each bit to a different chip. 
Without SECDED a bit failure induces the chip failure and the module failure as 
well. Assume the time to failure of a bit is exponential distributed, then the 
time to failure of a chip and that of a module are also exponential distributed. 
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In some cases, a bit hard failure could cause a chip failure with probability 
(1-S). We call it a catastrophic bit failure. Otherwise a bit failure is called 
non-catastrophic bit failure with probability S. Assuming the MTBF of the 
chip as a time unit, we have that the MTBF of a bit is m time units and the 
bit failure rate is 1/m. The MTBF and the failure rate of a module are 1/n 
and n respectively. The expected time between (i-l)th and ith bit failure, the 
expected time to ith bit failure, the probability of no two-bit failure in one 
word and the probability of two-bit failure in one word are stated in Table B-1. 
The module fails at the ith bit failure only when there is neither catastrophic 
failure nor two-bit failure in the same word before the ith bit failure, but 
there is a catastrophic failure or two-bit failure in the same word when the 
ith bit fedlure occurs. Since the transient and catastrophic failures of a module 
at the ith bit failure are mutually exclusive, the MTBF of a module with 
SECDED is given by 
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From the above expression, the reliability improvement factor can be shown as 
n. MTBF . When S=l, we have the upper bound of the factor' and MTBF . As 
S=0, we have the lower bound of the factor and MTBF^, if m is large enough the 
lower bound of the factor is 2. 


If the expected time between the (il-l)th and ith failure is fixed as n time units the 
expected time to the ith bit failure is i* n. The MTBF of a module with SECDED 
is given by: 


MTBF (1 - S) 
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Similarly as S=1 or 0, we have the upper bound or the lower bound of the factor 
and MTBF^, respectively. When m is large enough the difference between the 
MTBF^'s of the two models is negligible and so is that between the factors. The 
program for computing the reliability improvement factor are given in the 
Table B-2, 
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#0001 
# 00 :- 2 
#00i> 3 
#OOG^ 
#0005 
#0006 
#0007 
#0008 
#0009 
#0010 
#0011 
#0012 
#0013 
#001A 
#0015 
#0016 
#0017 
#0018 
#0019 
#0020 
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#0022 
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#0025 
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#00 31 
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#0036 
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#0038 
#0039 
#0040 
#0041 
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#0044 
#0045 
#0046 


SUBROUTINE SECF ACtH* S^FAC ) 


PROG. COMPUTES THE SECDED 

# OF BITS IN A CHIP. 

# OF CHIPS IN A HOOULEf 


RELIABILITY IMPROVEMENT 


ALWAYS EQUAL 


BITS 


THE PROS- OF A NO 
SECDED RELIA8IL 
DEFINED AS MT8F 
DEtflOED BY MTBF 


NON-CATASTROPHIC BIT FAILURE. 
ILITY IMPROVEMENT FACTOR^ 

8F OF THE MODULE WITH SECCED 
BF OF THE MODULE WITHOUT SECDED 


FACTOR. 

tlNPtn 

S IN A WORD 

(INPUT) 

----(INPIT) 

(OUTFIT) 


IMPLICIT REAL*8CA-H#0-Z) 
FM=FLOAT(H) 

FN=FLOAT(N) 

FHTFN=FM»FN 
SEDH=l./FN 
1 = 1 
PI=1. 

COMPUTING EXPECTED MTBF GIVEN THE 1 
SUM=(1.-S)*(SEDM*1-/<FN-1-)) 
SIHlsl, 

S1=SEDM 

DO 100 I=Z^M+1 
SIM1=SIK1*S 

STOP COMPUTING NEGLIGIBLE QUANTITY 
IF(SZM1.LE. I.E-IO) GO TO 200 
FI=FL0AT(I) 

FIM1=FL0AT( I-l) 

COMPUTING PROS. OF NO 2-BlT FAILUR 


FAILURE 


CATASTROPHIC 


WGRO 


PI=PI*IFM-(FIM1-1-))*FM/CFMTFN-(FIM1-1- )) 

COMPUTING EXPECTED TIME UP TO THE ITH BIT FAILURE 
S1=S1+FM/(FMTFN-FIM1) , „ 

COMPUTING PROB. OF THE MODULE FAILURE AT THE ITH BIT F/ 
S7=1--S*S*(FIM1*FN)/<FMTFN-FIHI) 

FPRQB==SIMl*PI*S3 

STOP COMPUTING NEGLIGIBLE QUANTITY 
IF(FPROB .LE. l.E-20) GO TO 200 
CJHPUTIKG^EXP|CTED^MT8F GIVEN THE MODULE FAILED BEFORE 

SUH=luM*ll*FPROB 
I CONTINUE 
FAC=SUM*FN 
RETURN 
END 


BIT FAILURE 



APPENDIX C 


SPARE PROCESSOR 


INTRODUCTION 

In Chapter 5, the reliability and availability calculations make use of the 
smtching of spare processors. This appendix presents the method of 
switching in more detail, to support the cladms made in Chapter 5. First, 
a discussion of the hardware that needs to be added to support the switching, 
and second, the implications for processor number are given. 

SWITCHING 

Figure C-1 shows a switching network, which amounts to one additional level 
of logic at the processor side of the transposition network. This network there- 
fore increases the depth of the transposition network from ten levels to eleven 
levels of logic. Switching is electronic, under software control. The 
spare processor can occur at any location from processor 0 to processor 128 
in the cabinet. Figure C-1 shows the first cabinetj the others are similar. 

No switching is needed in the connections to and from the control unit. All 
outputs from the control unit to processors are broadcast to all processors; 
the inputs from processors to CU are either ANDed together with a 512 -way 
AND, or ORed together with a 512-way OR in the fanout boards. The fanout 
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BELOW SPARE ABOVE SPARE 


Figure C-1. S-witching of Spare Processor in One Cabinet 


I 
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board needs appropriate input from the spare processor to form the correct , 

5 12 --way result. For example, in forming "all processors ready", or in 
forming "any processor enabled", the correct result will be achieved by 
having the spare processor's "enable" bit in the FALSE state. 

PROCESSOR NUMBER 

The PNO. instruction produces processor numbers from 0 through 511 in the 
512 processors that are switched into the system, independently of which ones 
are spare. Each processor in the cabinet has wired into its backplane a 
number from 0 through 128. Each cabinet has a number (0, 1, 2, or 3) set by 
a switch at the cabinet fanout board. If it were not for the spare processor, 
the cabinet number would be concatenated with the hard-wired number in the 
backplane to form the processor number. As it is, processors above the spare 
processor subtract 1 from their hard-wired number before concatenating it with 
the cabinet number to form the programmatic processor number as part of the 
PNO instruction. Thus, there are ten poles on each switch shown in Figure C-1. 
The eight data lines plus one strobe make nine poles for transposition network 
use, plus this bit for the PNO instruction to use in calculating the processor 
number. 

SETTING THE SPARE PROCESSOR SWITCH 

The setting of the spare processor switch is done only at a time when the array 
has halted. Switching is controlled from the diagnostic controller in response 
to commands from the host. Hence, the FMP programs are never aware of 
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■which processor is spare^ and as explained above, the PMP programs will 
always have an PMP of 512 processors, numbered from 0 through 511, on 
which to run. 
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